Easy Extend



                              




                           Author: Kay Schluehr
                           Date: 25.04.2007
                           Version: 0.1
                                                                    
                                                             



Introduction

This is an introductory level tutorial for EasyExtend. It guides you through the creation of an extension language that is simple enough to be understood in any detail but also complex enough to shed light on many aspects of EasyExtend. After reading this tutorial you shall be able to create your own extension languages, called fibers in the EasyExtend context.

Prerequisites: understanding the Python language on tutorial level and you have installed the EasyExtend package on your machine and opened a Python console and an editor.






Create a fiber

A fiber encodes an extension language or a code generator within the EasyExtend framework. It might be helpfull to think of a fiber as a Python dialect that adds some new syntax and semantics to the language. EasyExtend preprocesses source files of the new extension languages using the custom fiber definitions and generates proper Python code that will be compiled to bytecodes.

Physically a fiber is represented by a small set of files which are mandatory for each fiber definition. EasyExtend implements a package level command named new_fiber which does the job of creating these files for you. Our first fiber is called hex_fiber, which implements a modified semantics of a hexadecimal literal.

>>> import EasyExtend
>>> EasyExtend.new_fiber("hex_fiber", prompt = "hx> ")
New fiber 'hex_fiber' created:

  [EasyExtend]+-[fibers]
                 +- [hex_fiber]   
                     +- conf.py
                     +- fiber.py
                     +- PyGrammar.py
                     +- Grammar
                     +- [fiber_mod]
                         +- symbol.py
                         +- token.py

>>> EasyExtend.run("hex_fiber")    # checkout our new fiber
___________________________________________________________________________________

 hex_fiber

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
___________________________________________________________________________________

hx> 0x0F        # normal Python commands shall be available
15
hx> quit        # use quit to cancel a fiber - even in Python 2.4 !
___________________________________________________________________________________


The targetted hex_fiber will contain just one simple extension : a modified hexadecimal number literal. Python contains already a way to type hexadecimal numbers indicated by the 0x prefix but those are represented as decimal integers. We want to prevent an immediate rebasing of the number and work with hexadecimals directly. Hence the hex_fiber changes the semantics of hexadecimal literals such that a number 0x7F is represented by a still to be defined Hex object.

A use pattern of our modified Hex literal is shown in the next box:

hx> 0x00               # 0x00 is a Hex object represented by the hex-literal notation
0x00
hx> h = 1 2            # sequences of numbers are allowed and will be converted into Hex objects
hx> h
0x01 0x02
hx> h ++ 0x03 0x04     # new binary operator '++' defines concatenation among Hex objects
0x01 0x02 0x03 0x04

Once being defined we can add methods to our Hex object on our own behalf. The hex_fiber is usefull when bit fiddling is involved in our application a lot and we try to enable a native type feeling of all our hexadecimal operations. The hex_fiber is our first domain specific language - Voilá.

Define grammar rules

EasyExtend acts as a transformer of parse-trees or CSTs ( Concrete Syntax Trees ). CSTs reflect the grammatical structure of a language directly. Once you have added a new grammar rule to the fibers Grammar file ( Grammar is a file in the file tree of your fiber ) the structure of the CST is completely determined by this rule. The terminals of the grammar contain the material content of the source such as names, numbers or punctuation like parens and dots. They don't get expanded anymore and refer to nothing. That's why they are terminal. Non-terminals are rule templates that refer to other non-terminals or terminals. Pythons grammar is of class LL(1) and Python requires a mild preprocessing during tokenization. Don't be scared of semantic whitespace! In the Grammar it has a residual status in the form of the two terminals INDENT and DEDENT which are moral equivalents of braces.

Most fiber definitions start with the addition of a new grammar rule and the modification of an existing one. The modification is necessary because we have to link our new grammar rule with the existing rule tree. Otherwise new rules will stay disconnected and might not be applied by the parser. New grammar rules are always appended at the end of the Grammar file. This is enforced by compliency with Pythons bytecode compiler which expects a strict order of rules. So if you want to drop a rule don't delete it from the Grammar but unlink it so that it can't be found when parsing the source.

Requirements

In case of our hex_fiber exactly two requirements have to be translated into grammar rules:
  1. A sequence of hexadecimal numbers 0xXX.. 0xYY... shall be mapped onto one Hex object.
  2. A concatenation operator ++ is defined among Hex objects.
Together with the new operator ++ we define a new __concat__ protocol ( and also an __rconcat__ protocol, o.k.? ) that can be bound to other objects.

The numbers non-terminal

For handling the first requirement we introduce a numbers non-terminal. Pythons default Grammar handles just one NUMBER terminal as an atom as shown below:

# Grammar
...
...

factor: ('+'|'-'|'~') factor | power
power: atom trailer* ['**' factor]
atom: '(' [testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`' testlist1 '`' |
       NAME | NUMBER | STRING+
...

trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME

...

#
# adding new grammar rules here
#


When trying to type a sequence of numbers on our hex_fiber console the runtime thows a SyntaxError because the parser expects a subsequent non-numeral node. ( See how this is different from STRING terminals! ) When trying to type 1 [0] instead a grammatically correct expression is created on the first sight. However this expression is rejected on a deeper level within the compiler machinery and causes a TypeError as well:


hx> 1 3
  File "<stdin>", line 1
    1 3
      ^
SyntaxError: invalid syntax
hx> 1 [0]
Traceback (most recent call last):
  File "<stdin>", line 1, in ?
TypeError: unsubscriptable object...


Our first rule will replace the NUMBER terminal within atom by the numbers non-terminal which is defined at the end of the Grammar file.

# Grammar
...
...

factor: ('+'|'-'|'~') factor | power
power: atom trailer* ['**' factor]
atom: '(' [testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`' testlist1 '`' |
       NAME | numbers | STRING+
...

trailer: '(' [arglist] ')' | '[' subscriptlist ']' | '.' NAME

...

#
# adding new grammar rules here
#

numbers: NUMBER+

In order to process the numbers nonterminal we have to eliminate it from the parse tree again. Actually we have to assign meaning to it and this happens by transforming the numbers non-terminal into a valid Python expression. The location for these kinds of transformations is the FiberTransformer class in the fiber.py file of our hex_fiber package.

Inspecting parse trees

Before we proceed with defining transformers for our fiber lets examine how we can inspect token streams and parse trees to gain valuable information. The tools presented here are essential for debugging our fiber transformer. Without them we will likely spend hours with error seeking.

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python fiber.py
*** Modify PyGrammar.py file ***

___________________________________________________________________________________

 hex_fiber

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
___________________________________________________________________________________

hx> 1+2
3
hx> quit
___________________________________________________________________________________


It takes some startup time because a new PyGrammar.py file is generated containing modified parse tables. The message *** Modify PyGrammar.py file *** is displayed when the Grammar is processed.

It might be surprising that typing 1+2 into the console does not cause an error because we still act on an untransformed CST. To watch the CST there are two command line options -b and -a. The option -b stands for before and means that a CST is displayed before transformation. The option -a stands for after and means that a CST is displayed after transformation.

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python fiber.py -a      
___________________________________________________________________________________

 hex_fiber

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
___________________________________________________________________________________

hx> 1+2
[cst-after>

file_input  -- S`257 -- 257
  stmt  -- S`266 -- 266
    simple_stmt  -- S`267 -- 267
      small_stmt  -- S`268 -- 268
        expr_stmt  -- S`269 -- 269
          testlist  -- S`320 -- 320
            test  -- S`298 -- 298
              and_test  -- S`299 -- 299
                not_test  -- S`300 -- 300
                  comparison  -- S`301 -- 301
                    expr  -- S`303 -- 303
                      xor_expr  -- S`304 -- 304
                        and_expr  -- S`305 -- 305
                          shift_expr  -- S`306 -- 306
                            arith_expr  -- S`307 -- 307
                              term  -- S`308 -- 308
                                factor  -- S`309 -- 309
                                  power  -- S`310 -- 310
                                    atom  -- S`311 -- 311
                                      numbers  -- S`334 -- 334       <-------
                                        NUMBER  -- T`2 -- 2     L`1
                                          1
                                          1
                              PLUS  -- T`14 -- 14     L`1
                                +
                                1
                              term  -- S`308 -- 308
                                factor  -- S`309 -- 309
                                  power  -- S`310 -- 310
                                    atom  -- S`311 -- 311
                                      numbers  -- S`334 -- 334       <-------
                                        NUMBER  -- T`2 -- 2     L`1
                                          2
                                          1
      NEWLINE  -- T`4 -- 4     L`1


        1
  ENDMARKER  -- T`0 -- 0     L`2

    2


<cst-after]
3
hx>

The arrows  <---------  indicate nodes that cannot be compiled by the Python parser. So the compiler has to reject the CST and report a compiler error. The reason why the expression 1+2 works nevertheless is that the console has to unparse the CST back to source code before it can be evaluated on the fly. But this is just the consoles behaviour.

The same code snippet would fail as part of a script because the script can be compiled directly from the CST without a preceeding CST to source transformation.

In the listing above the parse tree is somewhat pretty printed. Another representation that is more closely related to the nature of the underlying datastructure is that of a nested list:

     [symbol.fileinput, [symbol.stmt,[simple.stmt,[....]]]

When using the parser module of Pythons standard library we get ( almost ) the same representation when parsing a Python expression ( or statement ) and listify it:

hx> import parser
hx> l = parser.expr("1+2").tolist()
hx> l
[258, [320, [298, [299, [300, [301, [303, [304, [305, [306, [307, [308, [309, [310, [311, [2, '1']]]]], [14, '+'], [308, [309, [310, [
311, [2, '2']]]]]]]]]]]]]]], [4, ''], [0, '']]
hx> pprint(l)

eval_input  -- S`258 -- 258
  testlist  -- S`320 -- 320
    test  -- S`298 -- 298
      and_test  -- S`299 -- 299
        not_test  -- S`300 -- 300
          comparison  -- S`301 -- 301
            expr  -- S`303 -- 303
              xor_expr  -- S`304 -- 304
                and_expr  -- S`305 -- 305
                  shift_expr  -- S`306 -- 306
                    arith_expr  -- S`307 -- 307
                      term  -- S`308 -- 308
                        factor  -- S`309 -- 309
                          power  -- S`310 -- 310
                            atom  -- S`311 -- 311
                              NUMBER  -- T`2 -- 2
                                1
                      PLUS  -- T`14 -- 14
                        +
                      term  -- S`308 -- 308
                        factor  -- S`309 -- 309
                          power  -- S`310 -- 310
                            atom  -- S`311 -- 311
                              NUMBER  -- T`2 -- 2
                                2
  NEWLINE  -- T`4 -- 4

  ENDMARKER  -- T`0 -- 0


hx>

Of course Python won't generate a numbers node in the tree. It reasonably operates on its own grammar.

How to read nodes

In the CST display each nonterminal is formatted like:
 
not_test  -- S`300 -- 300
which has the following general meaning:
   
<nonterminal node name> -- S`<node id> -- <projected node id>
The S` prefix is for Symbol.

Analog we have to read terminals like:
  NUMBER  -- T`2 -- 2   L`1
as
    
<terminal node name> -- T`<node id> -- <projected node id>   L`<Line number>
The T` prefix is for Token and the L` prefix is for Line.

Since EasyExtend permits multiple fibers to coexist in the same context, it assigns a unique range of node id's to the node types of a fiber.
Therefore fiber definitions can be unambigously transformed into Python expressions. The node id range for Python is 1..512. It contains the node id's being recognized by the compiler. After fiber transformation, all node id's of the parse tree are projected onto the Python range. This is easy because each Python node id is the restclass of division by 512. For the node id of the stmt node we have the equation:

           py_smbol.suite = fiber_symbol.suite % 512

The only restriction this imposes on a fiber definition is a rigid segmentation. For terminals ( token.py ) 256 node id's are reserved for Python. The same number is available for non-terminals ( symbol.py ). Currently around 60 terminals and 80 nonterminals are used. So there is still lots of space left. Let me know when you need a larger node id range.

Display the translated Python code

Each CST can be unparsed. You might display the unparsed CST using the option -p for Python. Restart the console and take a look:

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python fiber.py -p     
___________________________________________________________________________________

 hex_fiber

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
___________________________________________________________________________________

hx> 1+2
[python-source>

1 + 2

<python-source]
3
hx>

We have examined the -a option for watching the CST after being transformed and the -p option for watching the Python code of the transformed expression. Additionally there are following options:
  -b
show CST before it gets transformed
 -t
show token sequence
 -m node_name[,node_name]* mark nodes in CST with node(s) name given by the parameters of the -m
option. If multiple node names are used they have to be comma seperated

Check them out and also combine them:

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python fiber.py -b -m NUMBER,term -t     
___________________________________________________________________________________

 hex_fiber

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
___________________________________________________________________________________

hx> 1 2 3
[token>
----------------------------------------------------------------------------------------.
 Line  | Columns | Token Value | Token Name    | Token Id    | Token class | Class Id   |
-------+---------+-------------+---------------+-------------+-------------+------------+
 1     | 0-1     | '1'         | NUMBER        | 1538 -- 2   | NUMBER      | 1538 -- 2  |
 1     | 2-3     | '2'         | NUMBER        | 1538 -- 2   | NUMBER      | 1538 -- 2  |
 1     | 4-5     | '3'         | NUMBER        | 1538 -- 2   | NUMBER      | 1538 -- 2  |
 1     | 5-6     | \n          | NEWLINE       | 1540 -- 4   | NEWLINE     | 1540 -- 4  |
 2     | 0-0     | ''          | ENDMARKER     | 1536 -- 0   | ENDMARKER   | 1536 -- 0  |
----------------------------------------------------------------------------------------'
<token]

[cst-before>

file_input  -- S`1793 -- 257
  stmt  -- S`1802 -- 266
    simple_stmt  -- S`1803 -- 267
      small_stmt  -- S`1804 -- 268
        expr_stmt  -- S`1805 -- 269
          testlist  -- S`1856 -- 320
            test  -- S`1834 -- 298
              and_test  -- S`1835 -- 299
                not_test  -- S`1836 -- 300
                  comparison  -- S`1837 -- 301
                    expr  -- S`1839 -- 303
                      xor_expr  -- S`1840 -- 304
                        and_expr  -- S`1841 -- 305
                          shift_expr  -- S`1842 -- 306
                            arith_expr  -- S`1843 -- 307
                              term  -- S`1844 -- 308       <-------
                                factor  -- S`1845 -- 309
                                  power  -- S`1846 -- 310
                                    atom  -- S`1847 -- 311
                                      numbers  -- S`1870 -- 334
                                        NUMBER  -- T`1538 -- 2       <-------
                                          1
                                          1
                                        NUMBER  -- T`1538 -- 2       <-------
                                          2
                                          1
                                        NUMBER  -- T`1538 -- 2       <-------
                                          3
                                          1
      NEWLINE  -- T`1540 -- 4     L`1


        1
  ENDMARKER  -- T`1536 -- 0     L`2

    2


<cst-before]

Traceback (most recent call last):
  File "C:\lang\Python24\lib\site-packages\EasyExtend\eeconsole.py", line 183, in compile_cst
    _code = compile(src,"<input>","single", PyCF_DONT_IMPLY_DEDENT)
  File "<input>", line 2
    1 2 3
      ^
SyntaxError: invalid syntax
hx>

Obviously a seqence of blank separated numbers 1 2 3 cannot yet be backtransformed into a Python expression. Note however that we parsed accurately. Note also the node ids of the hex_fiber node. These node ids will likely differ in your own display.


The Fiber Transformer

Specify the numbers transformation

A numbers nonterminal has always the structure

     [symbol.numbers, [token.NUMBER, "1", line-number], [token.NUMBER, "0x00", 42],...].

Here 42 is a fictitious line number and can should be ignored by all our transformations. Note that the value of a terminal is always represented by a string.

The transformation of numbers is specified by those two rules:
  • When a single NUMBER terminal is found and its value starts with "0x" it shall be replaced by a Hex object. Otherwise it shall not be translated ( e.g. "1" shall not translated into a Hex ).
  • When a blank separated sequence of NUMBER terminals of at least length = 2 is found each terminal value shall be translated into a Hex object independent of all prefix considerations.

The fiber module

The transformation has to be encoded as a method in the FiberTransformer class of the fiber.py

The canonical fiber.py layout consists of some module imports, an "if __name__ == '__main__'" startup procedure that either executes a module or enters a console and two attributes named __publish__ and FiberTransformer.

from conf import*   # fiber specific modules and settings                  
import EasyExtend   # EasyExtend package
import PyGrammar    # contains fiber specific DFA tables used by the parser
from EasyExtend.csttools import*   # tools used to manipulate CSTs
from EasyExtend.eetransformer import Transformer, transform


class FiberTransformer(Transformer):
    '''
    Defines fiber specific transformations
    '''

__publish__ = []

if __name__ == '__main__':
    import fiber
    (options,args) = opt.parse_args()
    fiber.options  = options.__dict__
    if args:
        py_module = args[-1]
        EasyExtend.run_module(py_module, fiber)
    else:
        console = EasyExtend.create_console("hex_fiber", fiber)
        console.interact()

In the first step we add a transformation rule to our FiberTransformer which is a method named after the terminal or nonterminal that shall be transformed. To mark it as a transformation rule it shall be decorated with the @transform decorator.

class FiberTransformer(Transformer):
    '''
    Defines fiber specific transformations
    '''
    @transform
    def numbers(self, node):
        '''
        numbers: NUMBER+
        '''


Find subnodes

Now we keep the sequence of NUMBER terminals and check the length. If the length equals 1 the prefix of the first NUMBER is inspected.

class FiberTransformer(Transformer):
    '''
    Defines fiber specific transformations
    '''
    @transform
    def numbers(self, node):
        '''
        numbers: NUMBER+
        '''
        _numbers = find_all(node, token.NUMBER)   # find_all returns all NUMBER nodes found as
                                                  # subnodes of node
        if len(_numbers) == 1:
           if _numbers[0][1].startswith("0x"):          # keep a NUMBER and inspect its form
               # implement case of 0xYYYY...           
           else:
               # implement case of decimal or octal number
        else:
           
# implement case of sequence of numbers


Searching for subnodes of a node is done with the functions find_node and find_all.
find_node( cst, node_id, level=10000)
Seeks for a node with id = node_id in cst. It returns the first node found in a search is depth-first search or None.
          The maximal depth of the searched cst is constrained by level which has a high value by default. If level = 1 only
the immediate subnodes of cst will be inspected.
find_all( cst, node_id, level=10000)
Seeks for all nodes with id = node_id in cst. Return a list of these nodes. If no node is found the empty list is returned. The search mode is depth-first. The maximal depth of the searched CST is constrained by level which has a high value by default. If level = 1 only the immediate subnodes of cst will be inspected.
Most of the time we need just a few search operations to keep all important data to start or constrain a transformation.

Create new CSTs

We need to fill the implementation gaps with function calls that create new CST fragments. Everything is based on the cst.py module that contains functions for CST node creation. These functions are named after the nodes they create: we have a function try_stmt to create try-statement nodes, a test function to create test nodes etc. A module that provides abstractions from raw CST node creation is cstgen.py. It defines a more convenient and "intelligent" API. So you don't usually need to create the whole node structure from the bottom up but there are generic wrapper functions like any_test and any_stmt ( defined in csttools.py ).  Functions like CST_CallFunc, CST_Add etc. use these wrapper functions and produce the relevant nodes.


if len(_numbers) == 1:
    if _numbers[0][1].startswith("0x"):          
# keep a NUMBER and inspect its form
        return CST_CallFunc("Hex",[_numbers[0]])  # returns a node of type symbol.power.
    else:
        return atom(_numbers[0])                 
# do not wrap number by Hex and substitute
else:                                             # the numbers nonterminal by an atom
    # implement case of sequence of numbers


This same function can be applied to argument lists.

if len(_numbers) == 1:
    if _numbers[0][1].startswith("0x"):          
# keep a NUMBER and inspect its form
        return CST_CallFunc("Hex",[_numbers[0]])  # returns a node of type symbol.power.
    else:
        return atom(_numbers[0])                 
# do not wrap number by Hex and substitute
else:
   
return CST_CallFunc("Hex",_numbers)


The _numbers are passed as arguments to the Hex constructor.

Exercise: use option -p to inspect the translated code. Also checkout options -b, and -a to get a feeling for raw and transformed CSTs. Python will obviously complain about not finding Hex. That's o.k. at this stage.

The Hex class

We implement the necessary minimum of a Hex class. Besides __init__ and __repr__ only the methods __concat__ and __rconcat__ used to support the "++" operator are defined. Additional methods can be added on your will.

class Hex(object):
    def __init__(self, *args):
        self.bytes = []
        for arg in args:
            if isinstance(arg, (int,long)):
                self.bytes.append(self.create_from_int(arg).bytes)
            elif isinstance(arg, Hex):
                self.bytes.append(arg.bytes)
            else:
                raise TypeError,"cannot convert to Hex object: '%s'"%arg
        self.bytes = sum(self.bytes,[])

    @classmethod
    def create_from_int(cls, i):
        assert isinstance(i,(int,long)), "type of i is %s"%type(i)
        h = Hex()
        h.bytes = []
        v = i
        if v == 0:
            h.bytes = [0]
        else:
            while v:
                v,r = divmod(v,0x100)
                h.bytes.insert(0,r)
        return h

    def __len__(self):
        return len(self.bytes)
    
    def __concat__(self, other):
        if isinstance(other, Hex):
            h = Hex()
            h.bytes = self.bytes+other.bytes
            return h
        else:
            h = Hex()
            h.bytes = self.bytes+Hex(other).bytes
            return h
   
    def __rconcat__(self, other):
        return Hex(other).__concat__(self)

    def __repr__(self):      
        return " ".join(["0x%02X"%item for item in self.bytes])


Define a concatenation operator "++"

The operator '++' is used to concatenate Hex objects as arrays of bytes. This operator must be made public to the tokenizer first.

A generic tokenizer is configured using the EEToken module in eetoken.py. In order to define new token one must subclass EEToken in the token.py module of the fiber. This can be found in the fibermod directory of the hex_fiber package.


class EEToken(object):
    def __new__(cls):
        # --begin of default constants --
        # Warning! Do not change this section
        cls.ENDMARKER   = (0, _, TOKENIZE_PY)
        cls.NAME        = (1, _, TOKENIZE_PY)
        cls.NUMBER      = (2, _, TOKENIZE_PY)
        cls.STRING      = (3, _, TOKENIZE_PY)
        cls.NEWLINE     = (4, _, TOKENIZE_PY)
        cls.INDENT      = (5, _, TOKENIZE_PY)
        cls.DEDENT      = (6, _, TOKENIZE_PY)
        cls.LPAR        = (7, "(", LBRA)
        cls.RPAR        = (8, ")", RBRA)
        cls.LSQB        = (9, "[", LBRA)
        cls.RSQB        = (10, "]", RBRA)
        cls.COLON       = (11, ":", SPECIAL)
        cls.COMMA       = (12, ",", SPECIAL)
        cls.SEMI        = (13, ";", SPECIAL)
        cls.PLUS        = (14, "+", OPERATOR)
        cls.MINUS       = (15, "-", OPERATOR)
        ...

        ...

    @classmethod
    def Initial(cls):
        return map(re.escape, cls.select(LBRA))

    @classmethod
    def Final(cls):
        return map(re.escape, cls.select(RBRA))
    ...

The token ( terminals ) are defined in EEToken as triples.
                                                   
                          ( <node id>, <token value>, <token type> )

The numeral is just the node id of the token. In case of a token defined by Python these node_ids have to be fixed. The second entry is the token value. If no static literal is known an underscore _ can be used as a placeholder instead. The third entry is the token type. These token types are used to extract token from EEToken and form groups of them. For example the Initial and Final classmethods select all initial or final parens using the LBRA and RBRA token types. These methods are called within the gettoken function of the eetokpattern.py module that prepares token as regular expressions for the tokenizer defined in eetokenizer.py.

New custom token are defined in the token.py module of our fiber which resides in the fibermod directory. Opening the token.py file we find following FiberToken class definition.

from EasyExtend.eetoken import*

class FiberToken(EEToken):
    def __new__(cls):
        EEToken.__new__(cls)
        #
        # define new token here
        #
        return cls


We add our new token '++'. We have to select a name and a node id. The node shall not be in the range of the node ids defined in FiberTokens base class EEToken ( unless we intend to overwrite it explicitely). The node id is arbitrery but starting with 100 is never going to fail. Finally we select the token type OPERATOR.

from EasyExtend.eetoken import*

class FiberToken(EEToken):
    def __new__(cls):
        EEToken.__new__(cls)
        cls.PLUSPLUS = (100, '++', OPERATOR)
        return cls


Let's check now if our token definition works

tr> :hex_fiber -t
___________________________________________________________________________________

 hex_fiber

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
___________________________________________________________________________________

hx> x ++ y
[token>
(1537, 'x', (1, 0), (1, 1), 'x ++ y\n')
(1587, '++', (1, 2), (1, 4), 'x ++ y\n')     # yeah !!!
(1537, 'y', (1, 5), (1, 6), 'x ++ y\n')
(1540, '\n', (1, 6), (1, 7), 'x ++ y\n')
(1536, '', (2, 0), (2, 0), '')
<token]

Traceback (most recent call last):
  File "C:\lang\Python24\lib\site-packages\EasyExtend\eeconsole.py", line 218, in try_parse
    parseTree = dfa_parser.parsetok(tokenizer, self.grammar, self.fiber.symbol.file_input)
  File "C:\lang\Python24\lib\site-packages\EasyExtend\parser\DFAParser.py", line 292, in parsetok
    raise SyntaxError("Error in line %d%s" % (lineno, errMsg))
SyntaxError: Error in line 1, unexpected '++'
hx>


Again we get an error, since we need to transform the according terminal of the '++' operator.

Recognizing  "++" in the Grammar

We reopen the Grammar file and add the '++' operator. First we have to clarify the binding strength of ++. Obviously it shall be weaker than arithmetic operations. It shall also be weaker than shift and logical operations but stronger than logical and, or and not. The only place between logical and arithmetical / shift operators is the comparison node and although it is not obvious to subsume the concatenation operator under the comparison node we have no other choice:

...
test: and_test ('or' and_test)* | lambdef
and_test: not_test ('and' not_test)*
not_test: 'not' not_test | comparison
comparison: expr (('++' expr)* | (comp_op expr)*)    # was
expr (comp_op expr)*
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
expr: xor_expr ('|' xor_expr)*
...


Exercise: use option -b to inspect the CST of "x ++ y". Use also -m PLUSPLUS to display the PLUSPLUS terminal.

Define operator_concat

We define the behaviour of the concatenation operator. The Hex object is already prepared for usage by implementing the __concat__ and __rconcat__ protocols. But we approach the meaning of "++" in a more generic fashion. Each class can implement the __concat__ protocoll just like it might implement __add__ or __mul__.


def operator_concat(*args):
    '''
    Defines behaviour of '++'.  
    '''
    a,b = args[:2]
    if a in (basestring, list, tuple):
       res = a+b
    else:
        if hasattr(a,"__concat__"):
            res = a.__concat__(b)
        elif hasattr(b,"__rconcat__"):
            res = b.__rconcat__(a)
        else:
            raise TypeError, ":unsupported operand type(s) for ++: '%s' and '%s'"%(type(a), type(b))
    if args[2:]:
        return operator_concat(*(res,)+args[2:])   # recursion on sequences
    return res


In the final step we have to implement the transformer of the comparison node we modified in the Grammar. This time we use the macro fiber. Macros are expressions of the macro fiber. Macros contain special variables called node variables written in angle brackets. A node variable is a placeholder for CST nodes and passing a CST node into a macro leads to its expansion in place of the node variable. The macro fiber will be imported like any other module:

import EasyExtend.fibers.macro.fiber as macro_fiber

The macro fiber defines the class macro which implements the expand method.

class FiberTransformer(Transformer):
    '''
    Defines fiber specific transformations
    '''

    ...

    @transform
    def comparison(self, node):
        if find_node(node, token.PLUSPLUS, level = 1):
            _expressions = find_all(node, symbol.expr, level = 1)
            mc =
macro_fiber.macro("operator_concat(*<expr>)")
            return mc.expand({"expr": _expressions}, returns = symbol.comparison)

The target string "operator_concat(*<expr>)" contains the node variable <expr>. Since the interface of operator_concat expects a variable argument list and <expr> is expected to be a list of nodes ( the content of _expressions ), the node variable is declared as a stared arg *<expr>. Nodes are passed as a dictionary into the expand method. The node variables are represented by the keys of the dictionary while the values shall substitute the node vars. We have to care for the node type of the return value. A macro can be executed in any context and returns a top level node, mostly symbol.file_input. Substituting a node like comparison by file_input may have fatal consequences because it substitutes too much, namely the whole surrounding program. So we have to extract the comparison node from the file_input node and return it. This is done by declaring the value of the keyword returns to be symbol.comparison.

Exercise: try to use CST_CallFunc instead of the macro class. Checkout other symbols to be returned like symbol.test.

The __publish__ attribute

Whenever a list of numbers or a hexadecimal number is typed into a hex_fiber script it will be immediately converted into a Hex object and whenever a ++ operator is applied it will be converted into an operator_concat function call. However neither Hex nor the operator_concat function are globally visible unless imported explicitely. A rather simple means for ensuring global visibility is to put Hex and operator_concat into __builtins__. To simplify this even more each fiber has a __publish__ attribute that keeps a list of names. In case of the hex_fiber we have:

                __publish__ = ["Hex", "operator_concat"]

hex_fiber modules

We have completed our fiber definition. Now we want to write scripts using the specific hex_fiber syntax and execute them as fiber modules.

# module demo.py  ( version 1 )
# ------------------------------

def add_padding(arg, pad = 0x00, size = 8, right_pad = True):
    res = arg
    for i in range(size-len(arg)%size):
        if right_pad:
           res = res ++ pad   # right padding
        else:
           res = pad ++ res   # left padding
    return res

if __name__ == '__main__':
    print add_padding( 0x01 0x02 0x03 )
    print add_padding( 0x01 0x02 0x03 , pad = 0x80)
    print add_padding( 0x01 0x02 0x03 , pad = 0x80, right_pad = False)


The function add_padding adds leading or trailing bytes to a Hex object. Now we call demo.py from fiber.py.

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python fiber.py demo.py
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00
0x01 0x02 0x03 0x80 0x80 0x80 0x80 0x80 0x80 0x80 0x80
0x80 0x80 0x80 0x80 0x80 0x01 0x02 0x03

There is something remarkable about the result, which might not be too obvious on the first sight.
The if __name__ == '__main__' :... section of demo.py is called although demo is not __main__ in this case but fiber! The answer can be found when we look at the translated code:

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python fiber.py demo.py -p
[python-source>


# module demo.py
# --------------

def add_padding(arg, pad = Hex( 0x00 ) , size = 8, right_pad = True):
    res = arg
    for i in range(size-len(arg)%size):
        if right_pad:
            res =operator_concat(( res) ,( pad ) ,)
        else:
            res =operator_concat(( pad) ,( res ) ,)
    return res


def __like_main__ ( ):
    print add_padding( Hex( 0x01, 0x02, 0x03 ))
    print add_padding( Hex( 0x01, 0x02, 0x03 ) , size = 11, pad = Hex( 0x80 ))
    print add_padding( Hex( 0x01, 0x02, 0x03 ) , pad = Hex( 0x80 ) , right_pad = False )
if __name__ == '__main__':
    __like_main__ ( ) ;

<python-source]
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00
0x01 0x02 0x03 0x80 0x80 0x80 0x80 0x80 0x80 0x80 0x80
0x80 0x80 0x80 0x80 0x80 0x01 0x02 0x03

During code transformation the __like_main__() function has been created. The idea behind __like_main__ is that it can be called from the fiber and executes all statements in the function body that are otherwise defined in the if __name__ == '__main__':
section of the module.

Cooperation with standard Python

The __main__ section is still present for the purpose of demo being called directly from python. This shall be possible at least for the compiled version since demo.pyc contains the translated code. Unfortunately the module is not self contained yet. When we execute demo.py from fiber.py all fiber definitions are present, in particular Hex and operator_concat. Not so here. We need to import some names and initialize the fiber. Our solution is the import __fiber__ declaration.

# module demo.py  ( version 2 )
# ------------------------------

import __fiber__

def add_padding(arg, pad = 0x00, size = 8, right_pad = True):
    res = arg
    for i in range(size-len(arg)%size):
        if right_pad:
           res = res ++ pad   # right padding
        else:
           res = pad ++ res   # left padding
    return res

if __name__ == '__main__':
    print add_padding( 0x01 0x02 0x03 )
    print add_padding( 0x01 0x02 0x03 , pad = 0x80)
    print add_padding( 0x01 0x02 0x03 , pad = 0x80, right_pad = False)


The import __fiber__ statement will be expanded by the transformer into:


import EasyExtend.eecommon as eecommon; __fiber__ = eecommon.import_and_init_fiber( "hex_fiber" );


Now all usefull names are also present in demo.py when it is called as a standalone module from Python or imported by another Python ( not hex_fiber! ) module. Just take care that you call demo.pyc and not demo.py with Python.

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python demo.pyc
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00
0x01 0x02 0x03 0x80 0x80 0x80 0x80 0x80 0x80 0x80 0x80
0x80 0x80 0x80 0x80 0x80 0x01 0x02 0x03

In a second demonstration we create the demo2.py module that uses definitions of demo.py:

# module demo2.py  ( version  )
# ------------------------------

import demo

print demo.add_padding(Hex(0x01, 0x02, 0x03), pad = 0x7)


Note the name Hex is automatically available because it is made __builtin__ by the hex_fiber being initialized in demo.py

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python demo2.py
0x01 0x02 0x03 0x07 0x07 0x07 0x07 0x07