Easy Extend



                              




                            Author: Kay Schluehr  
                            E-Mail: kay@fiber-space.de
                            Date: 2008-03-02
                            Version: Tutorial for EasyExtend 3.0.2                                                                                                     



Introduction

This is an introductory level tutorial for EasyExtend. It will guide you through the creation of an extension language that is simple enough to be understood in any detail but also complex enough to shed light on many aspects of EasyExtend. After reading this tutorial you shall be able to create your own extension languages, called langletsin the EasyExtend context. For matters of brevity the tutorial won't cover all aspects of EasyExtend that are explained in more detail in the Trail parser document or the project overview document.

Prerequisites:
  • understanding the Python language on tutorial level 
  • understanding the very basics of grammars, lexers and parsers
  • having installed the EasyExtend package on your machine 
  • opened a Python console and an editor.



1. Create a langlet

Basically a langlet encodes an extension language or a code generator within the EasyExtend framework. It might be helpful to think of a langlet as a Python dialect that adds some new syntactical and/or semantical conventions to the language. EasyExtend preprocesses source files written in langlet source code and generates proper Python code that will be compiled to bytecodes.

Physically a langlet is represented by a small set of files on the disk which are mandatory for each langlet definition. EasyExtend implements a package level command named new_langlet which creates a langlet for you. The langlet we create in this tutorial is called hex_langlet. The hex_langlet is a domain specific Python extension used to deal with hexadecimal objects as bitmaps.

We start the creation process from a Python console:

>>> import EasyExtend
>>> EasyExtend.new_langlet("hex_langlet", prompt = "hx> ", source_ext = "phx")
*** Modify C:\lang\Python25\lib\site-packages\EasyExtend\langlets\hex_langlet\lexdef\lex_nfa.py ***
*** Modify C:\lang\Python25\lib\site-packages\EasyExtend\langlets\hex_langlet\parsedef\parse_nfa.py ***
New langlet 'hex_langlet' created:

... [EasyExtend]+-[langlets]
                   +- [hex_langlet]
                       +- __init__.py
                       +- run_hex_langlet.py
                       +- conf.py
                       +- langlet.py
                       +- [lexdef]
                           +- __init__.py
                           +- lex_symbol.py
                           +- lex_token.py
                           +- lex_nfa.py
                           +- Token.ext
                       +- [parsedef]
                           +- __init__.py
                           +- parse_symbol.py
                           
+- parse_token.py
                           +- parse_nfa.py
                           +- Grammar.ext
                       +- [reports]
                       +- [tests]
>>>

This takes 3-4 seconds on my machine. 

Three input parameters were set in the new_langlet function call. Those are

"hex_langlet"             the name of our new langlet. This parameters is mandatory
prompt = "hx> "    
prompt used in hex_langlet consoles ( Remark the blank after "hx>"! ). If omitted the Python standard prompt ">>> " is used
source_ext = "phx"    
file extension for langlet source files. If omitted the Python standard source file extension py is used.

Hex objects shall be hybrids between numbers and strings.

1.1 A first start

C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python run_hex_langlet.py
__________________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

hx> 0x00
0
hx>

When we type in 0x00 0x00 we see a typical error message of the Trail parser in EasyExtend 3.0

hx> 0x00 0x00
Traceback (most recent call last):
  ...
File "C:\lang\Python25\lib\site-packages\EasyExtend\trail\cfaparser.py", line 259, in handle_error
    raise ParserError(s+rule, **{"token": tok, "selection": selection})
SyntaxError: Failed to parse input '0x00' at (line 1 , column 9).

    0x00 0x00
         ^^^^

Expected symbol(s) from applied grammar rule:  simple_stmt: small_stmt ( T_NEWLINE | ';' )
                                                                       ~~~~~~~~~~~~~~~~~~~
hx>

The Trail parser complains that it can't parse '0x00'. It expects either a T_NEWLINE token i.e. a line break '\n' or a semicolon ';' while applying grammar rule simple_stmt. To put it short: after typing a single hex number the parser expects that the statement is finished. Part of our efforts will be the avoidance of this Error message while typing the same input in the hx-console.

1.2 Functional requirements 

Hex objects shall provide arithmetic operations for adding ( __add__ ) , multiplying ( __mul__) , etc. Hex objects. They shall also provide list operations for slicing, concatenation, subscripting ( __getitem__ ) etc. The Hex object shall convert ints and longs but also appropriate strings that can represent hex numbers on constructor call.

2. Defining Syntax

After having roughly described functional requirements we care for a look & feel. Hex objects shall feel like being native datatypes. Use cases are described in the box below.

hx> 0x00               # 0x00 is a Hex object represented by the hex-literal notation
0x00
hx> h = 1 2            # sequences of numbers are allowed and will be converted into Hex objects
hx> h
0x01 0x02
hx> h ++ 0x03 0x04     # new binary operator '++' defines concatenation among Hex objects
0x01 0x02 0x03 0x04
hx> h[0]              
# subscripts
0x01
hx> h+11              
# a little arithmetics
0x01 0x0d

2.1 Syntax Requirements

We summarize the syntactical requirements as
  1. A sequence of hexadecimal numbers 0xXX.. 0xYY... shall be mapped onto one Hex object. A single hexadecimal literal shall be treated as a Hex object.
  2. A concatenation operator ++ shall be defined among Hex objects.
Together with the new operator ++ we define a __concat__ protocol ( and also an __rconcat__ protocol ) that can be bound to other objects as well.

In order to implement those requirements we have to understand EasyExtend fundamentals.

2.2 Grammars and Tokenizers

EasyExtend descibes all syntax in the form of grammar files in EBNF notation. This is convenient syntax to describe grammar rules for top down parsers. EasyExtend uses plain EBNF i.e. there is no notation for dealing with operator precedences and there are no parser actions: all transformations are post processings on parse CSTs ( concrete syntax trees ) or token streams created by a lexer.

Actually the reader for grammar files in EasyExtend is realized as a langlet, the grammar_langlet. This illustrates another property of EasyExtend. It can be used to define arbitrary languages, not just Python extensions. So the very first language we consider is EBNF:

# Grammars Grammar

file_input: ( rule | NEWLINE )* ENDMARKER
rule: NAME ':' rhs NEWLINE
rhs: alt ( '|' alt )*
alt: item+
item: '[' rhs ']' | atom [ '*' | '+' ]
atom: '(' rhs ')' | NAME | STRING | '_'

Nonterminals ( grammar- or production rules ) like file_input, rule, rhs etc. are usually written in lower case letters while terminals like NAME, NEWLINE etc. are written in uppercase letters. Line comments are stripped. As an excercise you can apply the grammar on itself.

2.3 Terminals and Terminals of Terminals

The rule descriptions above can be found in the directory

     EasyExtend/langlets/grammar_langlet/parsedef/Grammar

A complementary file for terminal definitions in the dual branch

     EasyExtend/langlets/grammar_langlet/lexdef/Token

This highlights another property of EasyExtend which is new in version 3.0. There is a symmetry between grammar rule definitions used for parsers and grammar rule definitions used by lexers. EasyExtend 3.0 doesn't use regular expressions to define lexical content. Instead EBNF grammars are used to define the terminals of the parser as well. The logical status of terminals / non-terminals becomes somewhat relative: the non-terminals of the lexical grammar are terminals of the parser.

This leads naturally to the question of a bottom line. In EasyExtend 3.0 the terminals of the terminals are character-sets which are stated explicitely. In future versions this might be extended to characteristic functions of sets which leads to full UTF-8 support which is not yet available. So we have roughly following hierarchy

TokenToken Character Sets and functionally specified terminals.
Token Terminals of Grammar ( EBNF )
Grammar Production rules of Grammar ( EBNF )

The TokenToken objects have usually the form T_ANY, T_INDENT ... or A_WHITE, A_LINE_END, A_DIGIT...
The T_Object terminals are functionally specified and whereas A_Object token are charactersets.

2.4 Grammar.ext  and  Token.ext

A fundamental Grammar file and a Token file can be found in the root directory of EasyExtend. The Grammar file will always be the Grammar file of the current Python distribution i.e. the Grammar file of Python 2.5. The Token file will characterize the lexical structure of Python accordingly. You have to lookup them quite often but will never have to change them.

In order to define new rules for parsers and lexers the parsedef and lexdef directories of your langlet provide files Grammar.ext and Token.ext. Adding a new rule or modifying an existing rule always takes place in one of these files. If you remove e.g. Grammar.ext and place a Grammar file without the extension suffix in your parsedef directory EasyExtend won't create a parser for an extension language but one that is exactly suitable for the specified Grammar.
In the CST display each nonterminal is formatted like:

just renaming a Grammar.ext file to Grammar might not have an immediate effect. That's because EasyExtend will only create a new parser when the Grammar file is more recent than the file parse_nfa.py that contains the parser rule definitions. Otherwise a new parser was created on each startup. You have to open the Grammar file apply some irrelevant changes and save it. This will modify the timestamp of Grammar and EasyExtend will respond.

2.4.1 Token.ext for hex_langlet

We open Token.ext in the hex_langlet/lexdef directory and add the following rule:

# Token.ext of hex_langlet

PLUSPLUS: '++'


This is almost everything we have to do. However the PLUSPLUS nonterminal must be hooked somewhow into the existing token grammar. This is done by extending an existing rule.

# Token.ext of hex_langlet

OPERATOR: OPERATOR_DEF | PLUSPLUS
PLUSPLUS: '++'


The OPERATOR definition overwrites the definition given in the fundamental Token file.

2.4.2 Grammar.ext for hex_langlet

Next we change Grammar.ext of our langlet. First we define a new numbers non-terminal.

# Grammar.ext of hex_langlet

numbers: NUMBER+


and then we hook it into an existing rule.

# Grammar.ext of hex_langlet

atom: '(' [testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`' testlist1 '`' |
       NAME | numbers | STRING+
numbers: NUMBER+


The new numbers non-terminal replaces the former NUMBER terminal. Otherwise the atom rule is just copied from the fundamental Grammar.

2.4.3 Operator precedences

Next we want to place the '++' operator into Grammar.ext. Note that you don't have to use the notion PLUSPLUS but can use the content string '++' directly. EasyExtend will convert it into an appropriate numerical encoding automatically.

Currently EasyExtend doesn't produce a Warning or an Error when it finds a token string which it can't assign to a non-terminal definition. This might change in future.

For insertion of  '++' we have to deal with operator precedences i.e. binding strengths of operators. In EasyExtend those are directly expressed in the structure of the grammar.

Example: a + b * c

One can start with a rule that does not respect binding strength.
expr: term (( '+' | '*') term)*
term: NAME | NUMBER

and refactor it such that '*' binds stronger than '+'

expr: term ('+' term)*
term: factor ('*' factor)*
factor: NAME | NUMBER

A top-down parsers starts with expanding the term non-terminal. It has the internal product structure. Only when the expansion of the term is finished the parser proceeds by either terminating the parse ( zero or more repetitions allow termination ) or looking for a '+' operator and a subsequent term.

Following general rule applies: the lower the nonterminal in the rule hierarchy the stronger the bindings.

Now lets refocus on '++'. It shall have a lower precedence than any arithmetic operations and a higher precedence than logical operations like and / or / not.

In the concrete Grammar file we can hook it into the comparison rule:

...
test: and_test ('or' and_test)* | lambdef
and_test: not_test ('and' not_test)*
not_test: 'not' not_test | comparison 
comparison: 
expr (comp_op expr)* | concat_expr
comp_op: '<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not' 'in'|'is'|'is' 'not'
expr: xor_expr ('|' xor_expr)*
...

This looks somewhat hackish but comparison is an appropriate location for a hook of our new concat_expr with respect to operator precedence. The complete Grammar.ext file will look like
# Grammar.ext of hex_langlet

comparison: expr (comp_op expr)* | concat_expr
concat_expr: expr ('++' expr)+

atom: '(' [testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`' testlist1 '`' |
       NAME | numbers | STRING+
numbers: NUMBER+


Note that we used the rule form

concat_expr: expr ('++' expr)+

instead of

concat_expr: expr ('++' expr)*

Using the latter though wouldn't harm us either although there would be an ambiguity now. The concat_expr rule has to be transformed anyway and if there was a single expr only we could wrap it into a comparison node. By using the + repetition advice we can omit handling this additional case. If you are otherwise worried about left factoring issues, lookaheads, irresponsible backtracking costs etc. I recommend reading the Trail documentation.

2.4.4  A second run

After defining the new grammar rules we repeat our initial session. The trace has slighly changed:

C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python run_hex_langlet.py
*** Modify C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet\lexdef\lex_nfa.py ***
*** Modify C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet\parsedef\parse_nfa.py ***
__________________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

hx> 0x00
0
hx> 0x00 0x00
Traceback (most recent call last):
  File "C:\lang\Python25\lib\site-packages\EasyExtend\eeconsole.py", line 256, in compile_cst
    _code = compile(src,"<input>","single", COMPILER_FLAGS)
  File "<input>", line 1
    0x00 0x00
            ^
SyntaxError: invalid syntax
hx>


The SyntaxError will now be reported by the compiler not the parser. This is consistent with our expectations because the parser knows the new syntax rules now and creates a parse tree. The compiler however doesn't know how to deal with the new nodes. This problem shall be addressed in the next section.

3. Defining Behaviour

So far we have just added new syntax to Python. In the next step we have to transform the parse tree to eliminate all nodes again we have just added. Sounds silly but the Python compiler doesn't understand our extended parse trees for sure. Given a CSTExt
( Concrete Syntax Tree = parse tree of our langlet ) we have to produce a valid CSTPy i.e. a Python parse tree.  The complexity of the transformation T: CSTExt -> CSTPy depends entirely on the behaviour we want to express. The transformation isn't always simple but fortunately there is already a rich function library to settle upon for building transformations you can use.

EasyExtend 2.0 had also a high level macro like transformer API. The implementation was very complex and I dropped it in EasyExtend 3.0. A hopefully much simpler implementation will likely be re-enter EE in a later release. But even then it is worthwhile to learn transforming CSTs using just CST builder functions. This approach is lower level but more flexible and more general. Note also that transforming CSTs is quite comparable with transforming ASTs ( Abstract Syntax Trees ) given the right API i.e. EasyExtends API :)

Like adding new grammar rules, transforming CSTs requires a good understanding of Pythons grammar. The Grammar file is our central script and data sheet. Almost everything is derived from it: the parser, the CST builder functions and the CST validators. We are done very soon with token definitions but we always have to deal with the Grammar. Using EasyExtend for a while you will become very fluent in thinking in those grammar rules. It's a bit like learning an own little programming language. Well, the grammar is a declarative programming language.

3.1 Inspecting CSTs

You won't get anything done in EasyExtend without CST inspections. Unfortunately CSTs are huge and verbose even for quite small expressions. You can always unparse CSTs i.e. convert a CST back from tree structure into source code. You can mark certain nodes in CSTs for recovery purposes, you can display CSTs before and after transformations.You can check your final CSTPy against the Python Grammar before compilation. Despite all this debugging CSTs is not fun. Just don't say I didn't warn you.

3.1.1 Token Stream inspections

When you keep an error message from the parser which obtains that a certain parse rule can't be applied  a good way to start is examining the tokenstream. For displaying token streams start your session with the -t option:

hex_langlet>python run_hex_langlet.py -t
__________________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

hx> 0x00 ++ 0x01
[token>
----------------------------------------------------------------------.
 Line  | Columns | Token Value         | Token Name    | Token Id     |
-------+---------+---------------------+---------------+--------------+
 1     | 0-4     | '0x00'              | NUMBER        | 36354 -- 2   |
 1     | 5-7     | '++'                | PLUSPLUS      | 36437 -- 85  |
 1     | 8-12    | '0x01'              | NUMBER        | 36354 -- 2   |
 1     | 12-12   | '\n'                | NEWLINE       | 36356 -- 4   |
 2     | 0-0     | ''                  | ENDMARKER     | 36352 -- 0   |
----------------------------------------------------------------------'
<token]

1
hx>


The tokenstream is perfect. Just note a curiosity here. The Python interpreter will execute 0x00 ++ 0x01 as 0x00 + (+0x01) which evaluates to 1. This happened because EasyExtend can't compile CSTs in interactive sessions. This isn't so much a limitation of EasyExtend but Python ( unless you prove the opposite ). So EasyExtend has to unparse the transformed expression / statement first and sends the unparsed source to the compiler. The unparsed source is a valid Python expression and will therefore be executed.

This is the reason why you can't rely on the console only. When you pass a file test.phx to run_hex_langlet.py EasyExtend will pass the CST directly to the bytecode compiler which will raise an error. But you can't rely on this positive behaviour on the console as well. Just checkout expressions 1 ++ 1, 1+++1, 1++++1, etc. on an ordinary Python prompt and on the hex_langlet prompt with activated -t option.

Another important remark about the displayed table. In the Token Id column you see two numbers separetated by two hyphens. These are node ids. A node id is a numerical value characterizing a terminal or non-terminal uniquely within a langlet. The node id for NUMBER is 36354 in my hex_langlet. The node id of NUMBER in your hex_langlet might be entirely different. It doesn't matter. The absolute value has no meaning at all. What matters are two things. First it is the range of node id's. The range has always a width of 512. 256 node ids are reserved for terminals and 256 are reserved for non-terminals. The range is characterized by an offset. You can determine the offset by typing LANGLET_OFFSET in your hx> console. In my case I get

hx> LANGLET_OFFSET
36352


This number is always dividable by 512. The other number that matters is the rest of division by 512. This is the second number in the Token Id column:  36354 % 512 = 2, 36437 % 512 = 85 etc. These numbers are so important because they are those the Python compiler actually understands.  Suppose you have a parse tree for a normal Python statement e.g. a = 2+5 in the hex_langlet. The Python compiler can't do anything with the parse tree because it can't dispatch the node ids to actions in the compilation loop. What EasyExtend does is to project the CSTHex onto a CSTPy by taking the rest of division by 512 for each node id in CSTHex. That's like shifting the CSTHex by a constant amount. It also explains the constraints on the range: 256 + 256 is the partition the Python compiler uses for terminals + non-terminals.

3.1.2  CSTs before transformation

Inspecting token streams is a very first step but usually not the only one. The second step is to examine the parse tree just like it is produced by the Trail parser. The option being defined is -b.  b stands for before transformation.

hex_langlet>python run_hex_langlet.py -b
__________________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

hx> 0x00 ++ 0x01
[cst-before>

file_input  -- S`36609 -- 257
  stmt  -- S`36618 -- 266
    simple_stmt  -- S`36619 -- 267
      small_stmt  -- S`36620 -- 268
        expr_stmt  -- S`36621 -- 269
          testlist  -- S`36678 -- 326
            test  -- S`36655 -- 303
              ...
              concat_expr  -- S`36693 -- 341
                expr  -- S`36661 -- 309
                  ...
                  numbers  -- S`36692 -- 340
                    NUMBER  -- T`36354 -- 2     L`1
                      0x00
                  PLUSPLUS  -- T`36437 -- 85     L`1
                    ++
                  expr  -- S`36661 -- 309
                    ...
                    numbers  -- S`36692 -- 340
                      NUMBER  -- T`36354 -- 2     L`1
                        0x01
      NEWLINE  -- T`36356 -- 4     L`1
        '\n'
  ENDMARKER  -- T`36352 -- 0     L`2
    ''


<cst-before]
1
hx>

The tree is actually a little bigger but some nodes are omitted in the display because they somewhat obscure the relevant information. We see the nodes we have defined : numbers, PLUSPLUS and concat_expr.

Those nodes can be highlighted using the -m option. The -m option has an argument that is a comma separated list of node names.

hex_langlet>python run_hex_langlet.py -b -m PLUSPLUS,numbers
_________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (
_________________________________________________________________________

hx> 0x00 ++ 0x01
[cst-before>

file_input  -- S`36609 -- 257
  stmt  -- S`36618 -- 266
    simple_stmt  -- S`36619 -- 267
      small_stmt  -- S`36620 -- 268
        expr_stmt  -- S`36621 -- 269
          testlist  -- S`36678 -- 326
            test  -- S`36655 -- 303
              ...
              concat_expr  -- S`36693 -- 341
                expr  -- S`36661 -- 309
                  ...
                  numbers  -- S`36692 -- 340       <-------
                    NUMBER  -- T`36354 -- 2     L`1
                      0x00
                  PLUSPLUS  -- T`36437 -- 85       <-------
                    ++
                  expr  -- S`36661 -- 309
                    ...
                    numbers  -- S`36692 -- 340       <-------
                      NUMBER  -- T`36354 -- 2     L`1
                        0x01
      NEWLINE  -- T`36356 -- 4     L`1
        '\n'
  ENDMARKER  -- T`36352 -- 0     L`2
    ''


<cst-before]
1
hx>

If you want to display the elapsed node symbols use the --full-cst option additionally to the -b option.

3.1.3  CSTs after transformation

The next option we demonstrate is the option -a. a stands for after transformation.

C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python run_hex_langlet.py -a
__________________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

hx> 0x00 ++ 0x01
[cst-after>

file_input  -- S`257 -- 257
  stmt  -- S`266 -- 266
    simple_stmt  -- S`267 -- 267
      small_stmt  -- S`268 -- 268
        expr_stmt  -- S`269 -- 269
          testlist  -- S`326 -- 326
            test  -- S`303 -- 303
              ...
              concat_expr  -- S`341 -- 341       <-------
                expr  -- S`309 -- 309
                  ...
                  numbers  -- S`340 -- 340       <-------
                    NUMBER  -- T`2 -- 2     L`1
                      0x00
                  CONCAT  -- T`85 -- 85       <-------
                    ++
                  expr  -- S`309 -- 309
                    ...
                    numbers  -- S`340 -- 340       <-------
                      NUMBER  -- T`2 -- 2     L`1
                        0x01
      NEWLINE  -- T`4 -- 4     L`1
        '\n'
  ENDMARKER  -- T`0 -- 0     L`2
    ''


<cst-after]
1
hx>

All symbols that were not transformed are highlighted. This is an indication that this CST won't generally compile and you shall read the arrow markers as warnings.

3.1.4  CST validation

The next option is the -v option. Activate the -v option when you want to validate the final CSTPy against the Python grammar. Using option -a is useful when you want to show all nodes that are yet untransformed. Using -v does not display all nodes that go wrong but it can detect more subtle errors in the target CST like wrong tree structures.

C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python run_hex_langlet.py -v
__________________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

hx> 0x00 ++ 0x01
[cst-validation-test>

file_input  -- S`257 -- 257
  stmt  -- S`266 -- 266
    simple_stmt  -- S`267 -- 267
      small_stmt  -- S`268 -- 268
        expr_stmt  -- S`269 -- 269
          testlist  -- S`326 -- 326
            test  -- S`303 -- 303
              or_test  -- S`304 -- 304
                and_test  -- S`305 -- 305
                  not_test  -- S`306 -- 306
                    comparison  -- S`307 -- 307
                      concat_expr  -- S`341 -- 341   <====  No grammar conformance!

  ---->  Expected node(s): { expr -- S'309 }

                        expr  -- S`309 -- 309
                          xor_expr  -- S`310 -- 310
                            and_expr  -- S`311 -- 311
                              shift_expr  -- S`312 -- 312
                                arith_expr  -- S`313 -- 313
                                  term  -- S`314 -- 314
                                    factor  -- S`315 -- 315
                                      power  -- S`316 -- 316
                                        atom  -- S`317 -- 317
                                          numbers  -- S`340 -- 340
                                            NUMBER  -- T`2 -- 2     L`1
                                              0x00
                        CONCAT  -- T`85 -- 85     L`1
                          ++
                        expr  -- S`309 -- 309
                          xor_expr  -- S`310 -- 310
                            and_expr  -- S`311 -- 311
                              shift_expr  -- S`312 -- 312
                                arith_expr  -- S`313 -- 313
                                  term  -- S`314 -- 314
                                    factor  -- S`315 -- 315
                                      power  -- S`316 -- 316
                                        atom  -- S`317 -- 317
                                          numbers  -- S`340 -- 340
                                            NUMBER  -- T`2 -- 2     L`1
                                              0x01
      NEWLINE  -- T`4 -- 4     L`1
        '\n'
  ENDMARKER  -- T`0 -- 0     L`2
    ''


csttools.py:320: GrammarConformanceWarning: CST doesn't match target langlet grammar.
<cst-validation-test]
1
hx>

These aren't news for us. We haven't yet transformed the CST and don't expect the projected CSTPy being conform with the grammar. Note that the validation process is based on Trail. Using the Python grammar and the input node comparison the follow set of comparison  contains only the node { expr -- S'309 }.  

Checkout the CST validator when typing 1, '1' etc.

3.1.5  Python source code after transformation

Each CST can be unparsed. You might display the unparsed CST using the option -p for python. Restart the console and take a look:

EasyExtend\langlets\hex_langlet>python run_hex_langlet.py -p
__________________________________________________________________________________

 hex_langlet

 On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

hx> 0x00 ++ 0x01
[python-source>
0x00 ++ 0x01

<python-source]
1
hx>

3.1.6  Summary

We summarize shortly the options we have used. Of course you can combine any of them arbitrarily.

 -a show CST after transformation
 -b
show CST before transformation
 -p show unparsed python code after transformation
 -t
show token sequence
 -v CST validation
 -m node_name[,node_name]* mark nodes in CST with node(s) name given by the parameters of the -m
option. If multiple node names are used they have to be comma seperated.

Exercise:

Replace the numbers non-terminal by a more specific hexnumbers non-terminal
# Grammar.ext of hex_langlet
...
atom: '(' [testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`' testlist1 '`' |
       NAME | hexnumbers | NUMBER | STRING+
hexnumbers: Hexnumber+


and checkout what happens. Is hexnumbers always selected when building expressions 0x00, 0x00 ++ 0x01, 0x00 0x01 ... ?


4. The LangletTransformer

4.1  Specify the numbers transformation

A numbers nonterminal has always the structure

     [symbol.numbers, [token.NUMBER, "0x00", line-no], [token.NUMBER, "0x01", iine-no],...].

The transformation of numbers is specified by those two rules:
  • When a single NUMBER terminal is found and the value starts with "0x" it shall be replaced by a Hex object. Otherwise it shall not be translated ( e.g. "1" shall not translated into a Hex ).
  • When a blank separated sequence of NUMBER terminals of at least length = 2 is found each terminal value shall be translated into a Hex object independent of all prefix considerations.

4.2  The langlet.py module

The canonical langlet.py layout is rather sparse. The conf module contains fixed langlet specific settings and all imports other than csttools. The transformation of CSTHex  is encoded in the LangletTransformer class of  langlet.py. We ignore the other transformers. Finally there is the __publish__ attribute. Each name found in the __publish__ list is the name of a langlet attribute. Each of the corresponding attributes is added to the builtin namespace. This makes them truly visible globally.

from conf import*                  # langlet specific modules and settings                  
from EasyExtend.csttools import*   # tools used to manipulate CSTs

class LangletImporter(eeimporter.Importer):
    '''
    Defines langlet specific import behaviour.
    '''

class LangletTokenizer(eetokenizer.Tokenizer):
    '''
    Defines langlet specific token stream processor.
    '''

class LangletTransformer(Transformer):
    '''
    Defines langlet specific transformations
    '''

__publish__ = []


In the first step we add a transformation rule to our LangletTransformer which is a method named after the terminal or nonterminal that shall be transformed. To mark it as a transformation rule it has to be decorated with the @transform decorator. Uncommenting the @transform decorator makes it invisible for node dispatch. We call methods decorated by @transform also node handlers. Each node handler keeps a single argument.

class LangletTransformer(Transformer):
    '''
    Defines langlet specific transformations
    '''
    @transform
    def numbers(self, node):
        '''
        numbers: NUMBER+
        '''


As an additional aid we insert the grammar rule according to the node handler as a comment. This is just a convention and has no functional meaning.

4.3  Find subnodes

Next we check the sequence length of NUMBER terminals in numbers. If the length equals 1 the prefix of the first NUMBER will be  inspected according to our spec.

class LangletTransformer(Transformer):
    '''
    Defines langlet specific transformations
    '''
    @transform
    def numbers(self, node):
        '''
        numbers: NUMBER+
        '''
        _numbers = find_all(node, token.NUMBER)   # find_all returns all NUMBER nodes found as
                                                  # subnodes of node
        if len(_numbers) == 1:
           if _numbers[0][1].startswith("0x"):    # take NUMBER and inspect the shape
               # implement case of 0xYYYY...           
           else:
               # implement case of decimal or octal number
        else:
           
# implement case of sequence of numbers


Searching for subnodes of a node is done with the functions find_node and find_all.
find_node( cst, node_id, level=10000)
Seeks for a node with id = node_id in cst. It returns the first node found in a search is depth-first search or None.
          The maximal depth of the searched cst is constrained by level which has a high value by default. If level = 1 only
the immediate subnodes of cst will be inspected.
find_all( cst, node_id, level=10000)
Seeks for all nodes with id = node_id in cst. Return a list of these nodes. If no node is found the empty list is returned. The search mode is depth-first. The maximal depth of the searched CST is constrained by level which has a high value by default. If level = 1 only the immediate subnodes of cst will be inspected.
Most of the time we need just a few search operations to keep all important data to start or constrain a transformation.

4.4  Create new CSTs

Building new CST nodes using elementary CST builder functions is possible. In fact all CST synthesis actions relies on them. Using them isn't concise though. EasyExtend contains another higher level API of functions starting with a big "CST_". They are defined in cstgen.py and they provide simplifications when creating new nodes.

if len(_numbers) == 1:
    if _numbers[0][1].startswith("0x"):           

        return CST_CallFunc("Hex",[_numbers[0]])  # returns a node of type symbol.power.
    else:
        return atom(_numbers[0])                 
# do not wrap number by Hex and substitute
else:                                             # the numbers nonterminal by an atom
    # implement case of sequence of numbers


The CST_CallFunction can also be applied to argument lists.

if len(_numbers) == 1:
    if _numbers[0][1].startswith("0x"):          
# keep a NUMBER and inspect its form
        return CST_CallFunc("Hex",[_numbers[0]])  # returns a node of type symbol.power.
    else:
        return atom(_numbers[0])                 
# do not wrap number by Hex and substitute
else:
   
return CST_CallFunc("Hex",_numbers)


The _numbers are passed as arguments to the Hex constructor.

Exercise: use option -p to inspect the translated code. Also checkout options -b, and -a to get a feeling for raw and transformed CSTs. Python will obviously complain about not finding Hex. That's o.k. at this stage since we do not have added yet functionality to our transformation. This will be done in the next step.

4.5  The Hex class

We implement a rudimentary Hex class. Besides __init__ and __repr__ only the methods __concat__ and __rconcat__ which implement functinality of the PLUSPLUS terminal ( the "++" operator )  are defined. Additional methods can be added on your own behalf.

class Hex(object):
    def __init__(self, *args):
        self.bytes = []
        for arg in args:
            if isinstance(arg, (int,long)):
                self.bytes.append(self.create_from_int(arg).bytes)
            elif isinstance(arg, Hex):
                self.bytes.append(arg.bytes)
            else:
                raise TypeError,"cannot convert to Hex object: '%s'"%arg
        self.bytes = sum(self.bytes,[])

    @classmethod
    def create_from_int(cls, i):
        assert isinstance(i,(int,long)), "type of i is %s"%type(i)
        h = Hex()
        h.bytes = []
        v = i
        if v == 0:
            h.bytes = [0]
        else:
            while v:
                v,r = divmod(v,0x100)
                h.bytes.insert(0,r)
        return h

    def __len__(self):
        return len(self.bytes)
    
    def __concat__(self, other):
        if isinstance(other, Hex):
            h = Hex()
            h.bytes = self.bytes+other.bytes
            return h
        else:
            h = Hex()
            h.bytes = self.bytes+Hex(other).bytes
            return h
   
    def __rconcat__(self, other):
        return Hex(other).__concat__(self)

    def __repr__(self):      
        return " ".join(["0x%02X"%item for item in self.bytes])


4.5.1  Define operator_concat

The Hex object implements only the protocol methods __concat__ and __rconcat__. Now we have to define the transformation that is called whenever a '++' operator is detected.


def operator_concat(*args):
    '''
    Defines behaviour of '++'.  
    '''
    a,b = args[:2]
    if a in (basestring, list, tuple):
       res = a+b
    else:
        if hasattr(a,"__concat__"):
            res = a.__concat__(b)
        elif hasattr(b,"__rconcat__"):
            res = b.__rconcat__(a)
        else:
            raise TypeError, ":unsupported operand type(s) for ++: '%s' and '%s'"%(type(a), type(b))
    if args[2:]:
        return operator_concat(*(res,)+args[2:])   # recursion on sequences
    return res


In the final transformation step we have to implement the transformer of the concat_expr node.

class LangletTransformer(Transformer):
    '''
    Defines langlet specific transformations
    '''


    @transform
    def concat_expr(self, node):
        '''
        comparison: expr (comp_op expr)* | concat_expr
        concat_expr: expr ('++' expr)+
        '''
        exprs = find_all(node, symbol.expr, level = 1)       # extract expressions
        concat_call = CST_CallFunc("operator_concat", exprs) # operator_concat(exp1, ..., expk)
        return any_node(concat_call, symbol.comparison)      # wrap concat_call back into a
                                                             # comparison node



Exercise: we used any_node that wrapped a node of type symbol.power into symbol.comparison. Checkout other wrapper functions like any_test and any_expr and observe what happens. 

4.6  The __publish__ attribute

Whenever a list of numbers or a hexadecimal number is typed into a hex_langlet script it will be immediately converted into a Hex object and whenever a "++" operator is applied it will be converted into an operator_concat function call. But neither Hex nor the operator_concat function are globally visible unless imported explicitely. To ensure global visibility without module import of the language we intend to use we can put Hex and operator_concat into root namespace __builtins__. To simplify this aspect even more each langlet has a __publish__ attribute that keeps a list of names. In case of the hex_langlet we define:

                __publish__ = ["Hex", "operator_concat"]

This is not unlike the optional __all__ module attribute that regulates publically visible names on module imports.

5.  hex_langlet modules

We have completed our langlet definition. Now we want to write scripts using the specific hex_fiber syntax and execute them as fiber modules.

# module demo.phx 
# ----------------

def add_padding(arg, pad = 0x00, size = 8, right_pad = True):
    res = arg
    for i in range(size-len(arg)%size):
        if right_pad:
           res = res ++ pad   # right padding
        else:
           res = pad ++ res   # left padding
    return res

if __name__ == '__main__':
    print add_padding( 0x01 0x02 0x03 )
    print add_padding( 0x01 0x02 0x03 , pad = 0x80)
    print add_padding( 0x01 0x02 0x03 , pad = 0x80, right_pad = False)


The function add_padding adds leading or trailing bytes to a Hex object. Now we call demo.py from fiber.py.

c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python fiber.py demo.phx
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00
0x01 0x02 0x03 0x80 0x80 0x80 0x80 0x80 0x80 0x80 0x80
0x80 0x80 0x80 0x80 0x80 0x01 0x02 0x03

There is something remarkable about the result, which might not be too obvious on the first sight.
The if __name__ == '__main__' :... section of demo.py is called although demo is not __main__ in this case but langlet! The answer can be found when we look at the translated code:

hex_langlet>python run_hex_langlet.py -p --re-compile demo.phx
[python-source>

import EasyExtend.langlets.hex_langlet as __langlet__
import EasyExtend.eecommon ;
EasyExtend.eecommon.init_langlet(__langlet__)
 
def add_padding(arg, pad = Hex( 0x00 ) , size = 8, right_pad = True):
    res = arg
    for i in range(size-len(arg)%size):
        if right_pad:
            res =operator_concat(( res) ,( pad ) ,)
        else:
            res =operator_concat(( pad) ,( res ) ,)
    return res


def __like_main__ ( ):
    print add_padding( Hex( 0x01, 0x02, 0x03 ))
    print add_padding( Hex( 0x01, 0x02, 0x03 ) , size = 11, pad = Hex( 0x80 ))
    print add_padding( Hex( 0x01, 0x02, 0x03 ) , pad = Hex( 0x80 ) , right_pad = False )

if __name__ == '__main__':
    __like_main__ ( ) ;

<python-source]
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00
0x01 0x02 0x03 0x80 0x80 0x80 0x80 0x80 0x80 0x80 0x80
0x80 0x80 0x80 0x80 0x80 0x01 0x02 0x03

During code transformation the __like_main__() function has been created. The idea of __like_main__ is that it can be called from the langlet and executes all statements in the function body that are otherwise defined in the if __name__ == '__main__':
block.

Two additional remarks.

1.  We introduced a new command line option --re-compile. EasyExtend doesn't translate the phx module again if the compiled pyc version is more recent. So we have to enforce re-compilation to highlight aspects of the transformation process.

2. The top level import statements are generated as well. Only with them it was possible to run demo.pyc directly from the command line:    
              hex_langlet> python demo.pyc

     Since demo.pyc can be executed directly from Python you can also import demo from arbitary Python modules.

6.  Epilogue

Walking through the hex_langlet example was rather straightforward if one knows the system. Struggling with syntax definitions or wrong transformations is the norm though. I want to mention just briefly a few more commmand line options that aid the development process and an alternative to the @transform decorator we used in the LangletTransformer.

Two additional command line options are

  --dbg-lexer show lexer trace
  --dbg-parser
show parser trace

My recommendation is to use them as a last resort in a normal development session. They show some aspects of the interanl working of lexer and parser and they helped me identifying errors in Trails NFA expansion mechanism. So they are somewhat equivalent to core dumps or assembly inspection.

More useful in casual work is the replacement of the @transform decorator by the @transform_dbg decorator. The @transform_dbg takes a string argument as a parameter. This argument contains two letter commands describing various display options.

      @transform_dbg("ci co si")   # ci = input cst
                                   # co = output cst
                                   # si = input source

The complete overview of transform_dbg can be either found in the API docs, in the function definition
( EasyExtend/eetransformer.py )  or here.