Easy Extend



                         


                 
                       Author: Kay Schluehr
                       Mail: easyextend@fiber-space.de
                       Date of creation: 16.05.2006
                      
Last update: 23.03.2007
                       Version: 2.0 - alpha1


Introduction

This document introduces EasyExtend, a metaprogramming system for Python. It sketches the main lines of thought about EasyExtend and its design ideas.

What is EasyExtend?
  • EasyExtend (EE) is a preprocessor generator and metaprogramming framework written in pure Python and integrated with Python. The main purpose of EasyExtend is the creation of extension languages i.e. adding custom syntax and semantics to Python. EasyExtend comes out of the box as a single package with no dependecies other those to Pythons standard library.
  • EasyExtend respects separation of concerns. This virtue is in fact constrained by EEs architecture. You can't use EasyExtend to define syntax/code transformations from within usual application code although this might be possible by creating a special extension language that permits these kind of transformations. Programming programming languages is a distinct activity.
  • EasyExtend is grammar based like Lex/Yacc and ANTLR. This makes EasyExtend quite heavy weight but also very flexible. In principle you can define arbitrary languages using EasyExtend, not just Python extensions. Extension languages created using EasyExtend shall be considered as external domain specific languages (DSLs) in contrast to embedded DSLsas in Lisp or Ruby.
  • If you are more conservatively minded and believe that non-standard extensions of the Python language are evil, but static analysis, refactoring tools, code coverage etc. are usefull, you might consider code analysis + generation using EasyExtend.
  • If you are more experimentally minded or even visionary and think about unifying existing languages and pardigms within one or a small set of languages that play well together you might be interested in the concepts of a fiberspace, of languages as components and source level integration.


     Source tree overview and installation advices.



    About tokenizers, parsers and syntax tree constructors.


    Extension languages - how to create and transform them. The fiberspace architecture allows to combine them.



    Caveats and prospects.


1. Getting Started

1.1 Installation

EasyExtend is as a Python package distributed using distutils. It will be installed in the site-packages directory of your Python distribution using a distutils aware setup.py script. The EasyExtend package can be downloaded here. The documentation  as well as the tests are distributed separately.

Installing EasyExtend requires typing 

python setup.py install

on your Unix shell or DOS box prompt. Under Windows you might use the Windows installer exectuable instead.

Documentation and tests are packed together in a separete zip or tar.gz file.

[DocsAndTests 2.0-alpha1]+--[doc]
                         +--[EasyExtend]

The docs are a somewhat stripped down offline version of this site. Copy the tests directly into the site-packages directory over the just installed EasyExtend package.

After installation you should see roughly following source tree:

  [site-packages]+-[EasyExtend]+-[fibers]
                                 
+- [coverage]   
                                  +- [gallery]   
                                      +- __init__.py
                                      +- conf.py
                                      +- fiber.py
                                      +- PyGrammar.py
                                      +- Grammar
                                      +- [fibermod]
                                          +- __init__.py
                                          +- symbol.py
                                          +- token.py
                                  +-
[macro]   
                                  +- [Py25Lite]   
                                  +- [teuton]   
                                  +- [transit]   
                               +- [parser]   
                               +- [util]   
                               +- fs  
                               +- __init__.py  
                               +- cst.py  
                               ....
                                                           

1.2 First run

Switch to the directory \fibers\gallery which is subsequent to EasyExtend ( see file-tree above ) and type python fiber.py in your console.

C:\lang\Python24\Lib\site-packages\EasyExtend\fibers\gallery>python fiber.py
___________________________________________________________________________________

 Gallery

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
 Fiber documentation: www.fiber-space.de/EasyExtend/doc/gallery/gallery.html
___________________________________________________________________________________

gal> a = 1
gal> switch a:                       # first example - the switch statement
....    case 0:                      # for more information visit Gallery
....        print "nok"
....    case 1:
....        print "ok"
....
ok
gal>



Each fiber defines its own language encoded mainly in the fiber.py module of a fiber package. The fiber.py module is also the entry point for calling extension language modules.

C:\lang\Python24\Lib\site-packages\EasyExtend\fibers\gallery>python fiber.py test\test_gallery.py

test_chainlet (test.test_gallery.TestGallery) ... ok
test_importer (test.test_gallery.TestGallery) ... ok
test_mixed (test.test_gallery.TestGallery) ... ok
test_on_stmt_1 (test.test_gallery.TestGallery) ... ok
test_on_stmt_2 (test.test_gallery.TestGallery) ... ok
test_repeat_until_1 (test.test_gallery.TestGallery) ... ok
test_repeat_until_2 (test.test_gallery.TestGallery) ... ok

----------------------------------------------------------------------
Ran 7 tests in 0.000s

OK

1.3 Learning EasyExtend

The best way to start doing something usefull with EasyExtend might be reading the EasyExtend tutorial and work through the presented example. The following documentation partially overlaps with the content of the tutorial. This is due to the fact that most documents you can read here are actually mentioned as design documents written in an introductory style.

2. Grammars and CSTs


2.1 Concrete vs abstract syntax

EasyExtend is grammar based. This means that any proper language extension requires a modification of Pythons Grammar file. Each fiber has an own copy of Pythons Grammar and each modification of it causes a re-generation of the PyGrammar.py module of the fiber, which contains the parser tables according to the grammar.
         
When you parse an expression in Python 2.4 using the parser module of the std-lib and convert the parser.st object output into a list you get usually long chains of nested lists, each one starting with a number representing a node.

>>> import parser
>>> parser.expr("1+2")
<parser.st object at 0x0099D0F0>
>>> parser.expr("1+2").tolist()
[258, [320, [298, [299, [300, [301, [303, [304, [305, [306, [307, [308, [309, [310, [311, [2, '1']]]]], [14, '+'], [308, [309, [310, [311, [2, '2']]]]]]]]]]]]]]], [4, ''], [0, '']]



A number of this kind is called a node id within this document. Node ids are the types of nodes used in the parse tree. For Python these node ids can be found in the symbol.py and token.py modules which are automatically generated. They are fixed for a particular version of Python and must never be changed since the compiler relies on them. A parse tree created from the parser module is called a concrete syntax trees or CST within this document. Other that an abstract syntax tree (AST) a CST reflects directly the structure of the underlying grammar and also the syntax of an expression. A CST is equivalent to concrete syntax and you can reconstruct the source code almost faithfully from it ( line comments are skipped and also whitespace that is non-relevant for indentation is ignored ).

Example:


class A:pass
class A:
    pass

These two statements have the same semantics but different concrete syntax. The difference between both will be present in the CST but not in the AST description.

EasyExtend is based on the CST description. The main reason for this decision is the simple fact that abstract syntax requires an additional transformational step originating from a CST. But the whole point of EasyExtend is to define transformations based on a modified Grammar. So we would have to define two transformations for each fiber. One from a CST to an AST and second from an AST to another AST that can finally be compiled. EasyExtend enables only CST -> CST transformations and hides most of the the nitty-gritty details of  CSTs from the developer using an operational approach. In the rare occasions where the difference between abstract and concrete syntax really matters as in the example above one might apply normalize() which is defined in csttools.py. The normal form is always the most general form. In case of a suite it is a list of statements, not an expression within a single line.

2.2 Grammar extensions

Pythons syntax is not completely defined by Grammar. Indeed some aspects are factored into the tokenization process. This has the advantage of keeping Pythons Grammar very simple - it is indeed LL(1). Hence the strongest obstacle for parsing Python namely its indentation rules have a residual character in the Grammar namely as INDENT and DEDENT token that can be considered as moral equivalents of opening and closing braces. Otherwise this split leads to at least two files that need to be changed when both new nonterminals and new terminals of the grammar shall be introduced.

2.2.1  grammarObj generation

The EBNF grammar used by CPython and EasyExtend can be found in the file Grammar. Each fiber has an own version of this Grammar file. It contains a number of grammar rules for our LL(1) parser.

 .
 .
 .
fplist: fpdef (',' fpdef)* [',']
stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: expr_stmt | print_stmt  | del_stmt | pass_stmt | flow_stmt | import_stmt | global_stmt | exec_stmt | assert_stmt
expr_stmt: testlist (augassign (yield_expr|testlist) | ('=' (yield_expr|testlist))*)
augassign: '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' | '<<=' | '>>=' | '**=' | '//='
# For normal assignments, additional restrictions enforced by the interpreter
print_stmt: 'print' ( [ test (',' test)* [','] ] | '>>' test [ (',' test)+ [','] ] )
del_stmt: 'del' exprlist
pass_stmt: 'pass'
flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt | yield_stmt
break_stmt: 'break'
continue_stmt: 'continue'
return_stmt: 'return' [testlist]
 .
 .
 .

These rules form a declarative syntax description language; a kind of a domain specific language for language syntax. The Grammar reader is implemented in parser/PgenParser.py. The output of PgenParser.py serves as input for parser/PyPgen.py, which creates the parser table used by parser/DFAParser.py to parse source code. As a parser table a Python tuple datastructure grammarObj is stored as plain text in the module PyGrammar.py of the corresponding Grammar. So we always have a mapping
Grammar -> PyGrammar.py and whenever it has been noticed that Grammar has changed a new grammarObj is created and stored in PyGrammar.py.

2.2.2  Rule Sequence

The sequence of rules as they are placed in Grammar may not be changed. Whenever you want to add a new grammar rule you have to append the rule at the end of the file. You can also add content to an existing rule and delete existing clauses but you may never insert in or delete a rule from the mid of of Grammar.

The reason for this restriction is compiler compliency. The CPython compiler deals exclusively with the numerical token corresponding to a rule. This token can be considered as an index + offset into an array of rules. Changing the index has fatal consequences and causes a SyntaxError when detected during compilation.

            
single_input
=
256           

file_input
=
257

eval_input
=
259

decorator
=
260

...



Sometimes however you want just a minimalist grammar. As an example consider the grammar of Grammar:

# Grammars grammar

single_input: NEWLINE
file_input: ( RULE | NEWLINE )* ENDMARKER
RULE: NAME ':' RHS NEWLINE
RHS: ALT ( '|' ALT )*
ALT: ITEM+
ITEM: '[' RHS ']' | ATOM [ '*' | '+' ]
ATOM: '(' RHS ')' | NAME | STRING

This file describes the grammar of any Grammar file. Note that only rules form the end of the Grammar file are deleted. The single_input rule is a dummy, a placeholder used to preserve the minimal sequence.

            
single_input
=
256           

file_input
=
257

Otherwise you might suspect that source code being parsed with this particular Grammar will be mapped directly to Python objects for further treatment and not being compiled to bytecode before.

2.2.3  Appending rules , inserting clauses

A grammar is just another declarative little language. The following example shows a modified Grammar script where four rules are appended at the end of the file.

# Grammar for Gallery

...

# A modified Python grammar rule
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | funcdef | classdef | on_stmt |
               repeat_stmt | switch_stmt

...

testlist1: test (',' test)*

# not used in grammar, but may appear in "node" passed from Parser to Compiler
encoding_decl: NAME


######## ----- new grammar definitions   ------- #########

on_stmt: 'on' NAME '=' test ':' suite ['else' ':' suite]
repeat_stmt: 'repeat' ':' suite 'until' ':' (NEWLINE INDENT test NEWLINE DEDENT | test NEWLINE )

switch_stmt: 'switch' expr ':' NEWLINE INDENT case_stmt DEDENT ['else' ':' suite]
case_stmt: 'case' expr ':' suite ('case' expr ':' suite)*


Each of these rules has to be connected to the rest of the grammar. This is done by inserting their identifiers as clauses in compound_stmt. You are free to add them anywhere where this makes sense. If you construct an ambigous grammar the DFAParser will complain about an ambigous grammar ( printing message "XXX ambiguity!" )

2.3 CST manipulation

Transformation of CSTs involve analysis and synthesis of CSTs. Analysis requires just a small set of generic functions and knowledge about the structure of a grammar rule which can be read directly from the Grammar file. CST synthesis is the manual creation of valid CSTs from CST fragments. While there is a canonical base to start with on synthesis, namely functions associated with node types - for each node type there is a function that creates the node accordingly - using this base directly can be tedious and error prone. It is also hard to read. In version 1.0 of EasyExtend this base was everything supplied for CST synthesis and the result can still be examined in EasyExtend/fibers/gallery/fiber.py. In the meantime we investigated several approaches that shall be introduced in subsequent sections. But let's do some groundwork first.

2.3.1 Elementary node constructors

Each node of the Python Grammar has a corresponding function in the module cst.py. For example the grammar rule

    simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE

has a corresponding function

    def simple_stmt(*args):
        ....

which creates exactly nodes of the type simple_stmt. We call these functions elementary node constructors. The arguments passed into the simple_stmt node constructor are nodes of type small_stmt and only those. It is required not to pass colons or NEWLINE into simple_stmt since these are added automatically. As a general rule for any cst function following holds:

Whenever a terminal symbol can be derived from the context you must not pass it into a cst function.

One can see easily when to pass colons, commas, parens etc. Those rules have following structure:

    rule: T1 NT1 ... | T2 NT1 .. | T3 NT1 ...

where Ti are terminals and NT1 is a non-terminal. You need to pass a terminal to select the concrete rule. context.

Example:

    atom: '(' [testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}'....

Here braces, brackets or parens must be supplied. The pairs '(' ')', '[', ']' and '{', '}' are already complete characterizations of the node because the enclosed node is optional.

Example of usage:

from EasyExtend.cst import*
n_42 =
atom(Number(42))
n_2  =
factor(power(atom(Number(2)))))
simple_stmt(small_stmt(expr_stmt(test(...(arith_expr(term(factor(power(n_42,n_2)))))...)))

# o.k. stop here ...


As said above do not use elementary node constructors in your fiber definition directly unless you can't avoid them - you can most of the time.

2.3.1.1  Decorating CST functions

Each elementary CST function is decorated with py_cst. The reason for this is that node ids for a particular node are not unique! We have indeed a whole spectrum of possible node ids are assigned to the same node - one for each fiber ( see Fibers+Fiberspace for more information ).

2.3.2 Node wrapper

The test node represents a kind of upper limit of expressions in Python. Each node, subsequent to the test node, is guaranteed to be a terminal or an expression. This implies that almost all nodes nodes that use an expression also use a test node or a subsequent node. For wrapping a node into a test node we have defined the csttools.py function any_test:
any_test( node)
Returns a test node whenever the input is one of  the following nodes:  test, and_test, lambdef, not_test, comparison, expr, xor_expr, and_expr, shift_expr, arith_expr, term, factor, power, atom, Number, Name, String.  
Similarly any_stmt is defined for wrapping statements.
any_stmt( node)
Returns a correctly built stmt node whenever the input is one of  the following nodes: stmt, simple_stmt, compound_stmt, small_stmt, if_stmt, for_stmt, while_stmt, try_stmt, break_stmt, continue_stmt, return_stmt, raise_stmt, yield_stmt, expr_stmt, print_stmt, del_stmt, pass_stmt, flow_stmt, import_stmt, global_stmt, exec_stmt, assert_stmt.
Finally we can combine both wrapper functions and provide an additional selector:
any_node( node, node_id)
Combines any_test and any_stmt but returns a node of type node_id instead of test or stmt.
Nodes that neither fall into the category of test or stmt can't be wrapped. In that case one has to use elementary node constructors.

Example:

from EasyExtend.csttools import*                    # wrapper functions are defined in csttools
n_42 =
atom(Number(42))
n_2  = any_node(
2, py_symbol.factor)                # we use the alias py_symbol instead
import parser                                       # of symbol within cst and csttools

input = eval_input( testlist(any_test( power(n_42,n_2) )
)
eval(parser.sequence2st(input).compile()) -> 1764


2.3.3  The cstgen module

It is hard to compress the previous expression sequence in terms of node definitions. However it is actually possible using the cstgen module. The cstgen module is the moral equivalent of an AST builder. But the expressions returned from cstgen are again CSTs.

from EasyExtend.csttools import*                    # wrapper functions are defined in csttools
from EasyExtend.cstgen import*                      # wrapper functions are defined in csttools
import parser

input = CST_Eval( CST_Power(42,2) )
eval(parser.sequence2st(input).compile()) -> 1764


The set of cstgen functions is still incomplete. We see here how those can be used to hide details of CST composition by node adaption. The logical next step would be some fiber that drops the functional syntax and uses symbols to express relationships between CST nodes:
       
    (e_(42) ** 2).cst                  -> CST_Power(42, 2)
   
(e_(42) ** (e_(2) + e_("a"))).cst  -> CST_Power(42, CST_Add(2, Name("a")))
     e_("F")(42, 2).cst                -> CST_CallFunc("F",[Number(42), Number(2)])
    ...


This is the upper limit of adaption towards normal Python syntax using CST node synthesis only. The macro fiber works analytically using real but slightly enhanced Python source code and analysis and node variable substitution.

2.3.2  CST Analysis

While CST synthesis can be understood in many different ways and spans different levels CST analysis is relatively straightforward.
find_node( tree, node_id, level=10000, exclude=[])
Seeks for a node with id = node_id in tree. It returns the first node found in a search is depth-first search or None.
          The maximal depth of the searched tree is constrained by level which has a high value by default. If level is set 0 only
the direct subnodes of tree will be inspected. The optional exclude parameter is a list containing nodes excluded from search. When e.g. exclude = [ symbol.funcdef ] then find_node does not seek for node_id in subnodes of funcdef.
find_all( tree, node_id, level=10000, exclude = [])
Returns a generator that seeks and yields any node with id = node_id in tree. The search mode is depth-first.
          The maximal depth of the searched tree is constrained by level which has a high value by default. If level is set 0 only
the direct subnodes of tree will be inspected.The optional exclude parameter is a list containing nodes excluded from search. When e.g. exclude = [ symbol.funcdef ] then find_node does not seek for node_id in subnodes of funcdef.
Variants of find_node and find_all are the functions find_one_of and find_all_of. d in many different ways and spans different levels CST analysis is relatively straightforward.
find_one_of( tree, node_ids =[], level=10000, exclude=[])
Like find_node but  accepts all node ids as search parameters passed in the list node_ids.

find_all_of( tree, node_ids =[], level=10000, exclude=[])
Like find_all but  accepts all node ids as search parameters passed in the list node_ids.
 

2.4  Tokenization

In section 1.2 we described how to create new grammar rules inserted into the Grammar file of the fiber. This way we could define new nonterminals. The tokenizer used by EasyExtend is an extended version of the tokenizer found in tokenize.py of the standard library. It is split into two modules eetokpattern.py and eetokenizer.py. The token.py module of Pythons stdlib is enhanced to eetoken.py.

2.4.1  Token definitions

When you activate the -t option and run a fiber on the console the token stream for each input will be shown:

C:\lang\Python24\Lib\site-packages\EasyExtend\fibers\gallery>python fiber.py -t
___________________________________________________________________________________

 Gallery

 Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310 32 bit (Intel)]
 Fiber documentation: www.fiber-space.de/EasyExtend/doc/gallery/gallery.html
___________________________________________________________________________________

gal> def f():
....     pass
[token>
----------------------------------------------------------------------------------------.
 Line  | Columns | Token Value | Token Name    | Token Id    | Token class | Class Id   |
-------+---------+-------------+---------------+-------------+-------------+------------+
 1     | 0-3     | 'def'       | NAME          | 3585 -- 1   | NAME        | 3585 -- 1  |
 1     | 4-5     | 'f'         | NAME          | 3585 -- 1   | NAME        | 3585 -- 1  |
 1     | 5-6     | '('         | LPAR          | 3591 -- 7   | OP          | 3635 -- 51 |
 1     | 6-7     | ')'         | RPAR          | 3592 -- 8   | OP          | 3635 -- 51 |
 1     | 7-8     | ':'         | COLON         | 3595 -- 11  | OP          | 3635 -- 51 |
 1     | 8-9     | \n          | NEWLINE       | 3588 -- 4   | NEWLINE     | 3588 -- 4  |
 2     | 0-4     | '    '      | INDENT        | 3589 -- 5   | INDENT      | 3589 -- 5  |
 2     | 4-8     | 'pass'      | NAME          | 3585 -- 1   | NAME        | 3585 -- 1  |
 2     | 8-9     | \n          | NEWLINE       | 3588 -- 4   | NEWLINE     | 3588 -- 4  |
 3     | 0-0     | ''          | DEDENT        | 3590 -- 6   | DEDENT      | 3590 -- 6  |
 3     | 0-0     | ''          | ENDMARKER     | 3584 -- 0   | ENDMARKER   | 3584 -- 0  |
----------------------------------------------------------------------------------------'
<token]

....
gal>


Each non-alphanumeric character is expressed by a token with an own characteristic token id. Whitespaces are dropped unless they indicate indentation. However they can be easily reconstructed from column information.


The develop
...

Operator = group(r"\*\*=?", r">>=?", r"<<=?", r"<>", r"!=",
                 r"//=?",
                 r"[+\-*/%&|^=<>]=?",
                 r"~")

Bracket = '[][(){}]'
Special = group(r'\r?\n', r'[:;.,`@]')
Funny = group(Operator, Bracket, Special)

PlainToken = group(Number, Funny, String, Name)

...





                 

    3. Fibers + Fiberspace

                                                                                                             

3.1  Fiber transformers

We have described how to define new syntax and analyze and synthesize CSTs. Now we consider the semantical and transformational aspect of EasyExtend.

3.2  The fiberspace

We have already called an extension language in EasyExtend a fiber. A fiber is a Python package defined by a minimal collection of modules. In the previous section we learned about the relationship between Grammar and PyGrammar.py and how to use the basic tools for CST synthesis to create new syntax trees.

  +- [any-fiber]   
      +- __init__.py
      +- conf.py
      +- fiber.py
      +- PyGrammar.py
      +- Grammar
      +- [fibermod]
          +- __init__.py
          +- symbol.py
          +- token.py

Next we will consider some details of the conf.py as well as the symbol.py module in detail.

  +- [any-fiber]   
      +- __init__.py
      +- conf.py
      +- fiber.py
      +- PyGrammar.py
      +- Grammar
      +- [fibermod]
          +- __init__.py
          +- symbol.py
          +- token.py

3.1.1  Fiber offsets

When you open the symbol.py module of an arbitrary fiber like Gallery you will notice a copy of Pythons symbol.py module just with different node ids:

#  This file is automatically generated; please don't muck it up!

#--begin constants--

single_input = 3840
file_input = 3841
eval_input = 3842
decorator = 3843
decorators = 3844
...
...
testlist1 = 3916
encoding_decl = 3917
on_stmt = 3918
repeat_stmt = 3919
switch_stmt = 3920
case_stmt = 3921

#--end constants--


These node ids are not selected by chance. When you open conf.py of the same fiber the file starts with following text

########################################################################
#
#                 Language configuration
#
########################################################################

# --- unique fiber offset ---
# Warning! do not touch this. It will be automatically generated by fs.py

FIBER_OFFSET = 3584

# modules and functions

import os
import sys
...

When you add 256 to 3584 you will get 3840 which is exactly the node id of the first node single_input of symbol.py. Moreover the FIBER_OFFSET is a multiple of 512:  3584 = 7*512.

All these data are generated automatically and you don't have to care about them a lot, unless you debug fiber transformers. Each fiber is characterized by a unique FIBER_OFFSET which is a multiple of 512 and used to identify its nodes in an arbitrary CST. The unique segment for each fiber contains 512 possible nodes. The lower half is reserved for token, while the upper half is reserved for symbols.

 
X in [FIBER_OFFSTET, FIBER_OFFSET+255]
X is token 

X in [FIBER_OFFSTET+256, 2*FIBER_OFFSET-1] X is symbol

Currently an average fiber uses ~60 token and ~80 symbols so there is still much space left.

3.1.2  Projection

Among the numbers that matter here is also MAX_PY_SYMBOL defined in cst.py

    MAX_PY_SYMBOL = 333     # maximal node_id of Python 2.4 symbols.

This number is the maximum node id found in Pythons symbol.py and is release specific. For example Python 2.5 defines some new grammar rules and its MAX_PY_SYMBOL is 339. The MAX_PY_SYMBOL is so important because it is related to node id projection. The node ids of a fiber are meaningless to the Python compiler. The compiler does not know how deal with a node id 3840. Before we compile a fiber CST we have to project it onto a Python CST. This projection is very simple because we just need to keep the rest of a node id by division with 512.

                                  projection
        CSTFiber -------------> CSTPython

Projection can also be used to indicate transformation errors. If a projected CST contains node ids exceeding MAX_PY_SYMBOL the transformation is still not accomplished.

3.1.3  The idea of a fiberspace

When each fiber defines its own distinctive set of node ids, transformations can be assigned uniquely to these nodes. When trying to mix different fibers within the same transformation context we need a central transformer that holds all particular fiber transformers and dispatches nodes to specified transformations of these FiberTransformer instances. This central transformer is called FiberspaceTransformer or just FSTransformer.

The eetransformer.py module creates a single instance of FSTransformer called fs_transformer. The run() method of any FiberTransformer simply delegates the call to fs_transformer. There is also some registration and scheduling of FiberTransformers involved but the essential idea is that the fs_transformer coordinates each node transformation.

3.1.4  Ranging

When a module gets parsed into a CST the domain of the node ids is always completely determined by the main fiber Fmain. Let now another fiber  F'. enter the fiberspace. How can a node be dispatched to a node transformer of F' when F' defines a different set of node ids? The answer is called ranging. For each node id n we define the base n0 = n%512 and the range
[0,n0+FIBER_OFFSETF1n0+FIBER_OFFSETF2,...] defines all admissable node ids for which a transformer may be defined. For two node ids n1 and n2 we write n1 ~ n2 when they are in the same range.

If a node id n gets matched exactly the corresponding node transformer will be selected, when available, otherwise ranging will be applied to select the first transformer available for a node in the range. Ranging is important also on another occasion, where some CST fragment has to be inserted into an existing CST. In this case all node ids in its range are considered equivalent and if n1 is the root node of the CST fragment it may replace any other fragment with root n2 with n1 ~ n2.

Caution: The above transformer selection does not deal with competition among transformers applying to the same node id but is a simple greedy algorithm. A more advanced method to deal with competition has still to be investigated.

3.2 The framework

If you consider a framework as an "application with holes" the holes of the EE framework are the particular fibers that contain the language definition. While the EasyExtend package is a class and function library used by any fiber the complete fiber is passed back to EasyExtend as a module! So the fiber module serves as the language interface that suffices a few conventions. A minimal fiber has to look like this:

# minimal fiber that provides EE framework interface
import EasyExtend   # EasyExtend package
import conf         # defines some fiber properties e.g. the console prompt, paths etc.
import PyGrammar    # contains fiber specific DFA tables used by the parser

class FastTransformer(EasyExtend.eetransformer.Transformer):
    '''
    Defines fiber specific transformations
    '''


Since we have to pass the fiber to EE we need to import it somehow. This can be done in a separate main.py for instance or we import the fiber within the fiber and pass it afterwards:

# minimal fiber that provides EE framework interface
import EasyExtend   # EasyExtend package
import conf         # defines some fiber properties e.g. the console prompt, paths etc.
import PyGrammar    # contains fiber specific DFA tables used by the parser

class FastTransformer(EasyExtend.eetransformer.Transformer):
    '''
    Defines fiber specific transformations
    '''

if __ name__ == '__main__':
    import sys
    import minimal_fiber              # import this fiber
    if len(sys.argv)>1:
        py_module = sys.argv[1]
        EasyExtend.run_module(py_module, fiber)
    else:
        console = EasyExtend.create_console("MinimalFiber", fiber)
        console.interact()

As long no deviant Grammar is defined, the behaviour of the minimal fiber shall be the same as that of the Python interpreter.
  

3.2.1  eetransformer

The Transformer module provides mappings between the CST of a fiber F and a Python CST. For this purpose the F-CST will be traversed and the F specific nodes will be replaced. Each node that shall be visited has to be made public using a method signature like this one that is defined in the Gallery fiber:
repeat_stmt( node)
Transformation of the content of node where node is the repeat_stmt node of the fiber CST. According to the transformation rule for the repeat statement defined for Gallery it has to be replaced by a while_stmt node of a Python CST.
An essential part and the main difficulty of the transformer algorithm is to move the returned nodes to the right place in the CST. This aspect is not solved analytically for any kind of node but it turned out that a stable heuristics is applicable.
Depending on the node type that shall be replaced return either:
  • A node of type test which represents the most general form of a Python expression in the CST.
  • One or more nodes of type stmt that represent arbitrary Python statements in the CST.
For convenience the surgery module offers methods any_test and any_stmt that wrap expressions or statements of any kind such that test and stmt nodes are left as result:

any_test( node)
Returns a correctly built test node whenever the input is one of  the following nodes: test, and_test, lambdef, not_test, comparison, expr, xor_expr, and_expr, shift_expr, arith_expr, term, factor, power, atom. Additionally each node that is part of the atom grammar rule that is at the bottom of the chain above such as Name, Number etc. fits as well.
any_stmt( node)
Returns a correctly built stmt node whenever the input is one of  the following nodes: stmt, simple_stmt, compound_stmt, small_stmt, if_stmt, for_stmt, while_stmt, try_stmt, break_stmt, continue_stmt, return_stmt, raise_stmt, yield_stmt, expr_stmt, print_stmt, del_stmt, pass_stmt, flow_stmt, import_stmt, global_stmt, exec_stmt, assert_stmt.
Example: in Gallery the repeat loop is transformed as follows:

class FastTransformer(Transformer):
    def handle_repeat_stmt(self, node):
        "repeat_stmt: 'repeat' ':' suite 'until' ':' (NEWLINE INDENT test NEWLINE DEDENT |
                                                      test NEWLINE )"
        _suite = find_node(node,symbol.suite)
        _test  = find_node(node, symbol.test, level=0)
        _until = if_stmt(_test, suite(any_stmt(break_stmt())))
        _suite.insert(-1, any_stmt(_until))
        return any_stmt(while_stmt(any_test(Number(1)),_suite))


3.2.1.1  Being __like_main__

With EasyExtend the combination python main.py represents a specialized interpreter not alone python. But running e.g.
python main.py gallery_module.py has the drawback that the convenient block statement of any gallery_module.py

__name__ == '__main__':
      ...


if present in gallery_module.py , will not be executed since gallery_module.py is not the __main__ module but main.py ! The EE solution is a generic transformation of the block statement into a function called  __like_main__. This function is not dedicated to be defined explicitely by the user ( please don't don't define it. Otherwise I select a weirder name ) but created by the eecompiler during parse-tree transformation:


if __name__ == __main__:
    BLOCK  

 ==> 
def __like_main__():
    BLOCK

if __name__ == '__main__':
    __like_main__()

Running the compiled module gallery_module.pyc from python will cause execution of __like_main__() because gallery_module.py is now the __main__ module. If otherwise gallery_module.py is called from python main.py  the __like_main__ function is called by main.py:


# main.py

...

if __name__ == '__main__':
    mod_name = sys.argv[1]
    mod = create_module(mod_name)
    try:
        mod.__like_main__()
    except AttributeError:
        pass
      



3.3  The eeimporter

The eeimporter module and the Importer class defined in this module play an important role in defining the boundaries of the fiber. Let m = m(F) be a module of the fiber F then we need to know how m has to import dependent modules. There are a some basic alternatives:
  1. All files belong to a fiber F. This is basically what Python asserts.
  2. Define a FIBERPATH environment variable. Each module in the specified or subsequent directories will be imported as a module of fiber F.
  3. Define a regular expression to identify naming characteristics. That's how the coverage fiber defines dependencies between a module text_A.py and a module A.py.
  4. Use a general criterion. Gallery assumes that Fibers\gallery is the root directory where gallery related modules are defined. Some of those modules are excluded from import.
  5. Others to be defined.
An instance of an Importer class or a derived class will be added to sys.meta_path to create an "import hook". This is a bit of an undocumented Python feature I'm not going to explain here. What I know about the behaviour of those import hooks I derived from experimentation and reading of the testcode of test_importhooks.py.

Note: If you try to import a module that was already imported at another place Python takes the module from sys.modules. The import hook won't work in this case.

Here is a list of modules that are already imported when the coverage fiber Importer registers itself at sys.meta_path. Ironically all of the framework modules were imported so that quality of the EE unittests cannot be checked with coverage - at least not directly.

The Importer module provides two methods that are relevant if you want to define your own custom imports:
register_importer( )
Registers this importer at sys.meta_path.
accept_module( fs_path)
Defines fiber specific acceptance rules for module at fs_path where fs_path is the file systems path of the module.
The accept_module method replaces the find_module method as the method that shall be overwritten in Importer subclasses.

2.2.3  The conf module

The conf module is used to define some basic fiber related settings. Currently only the prompt attribute and the ext attribute will be supported where prompt refers to the sys.ps1 that will be shown in the interactive console and ext refers to the bytecode file extension.

import gallery

>>> gallery.conf.prompt
'gal> '



2.2.4  fibermod

Fibers that define own grammar rules / token need specialized versions of Pythons stdlib modules symbol.py, token.py. The symbol.py module will always be newly generated whenever a Grammar file changes. The token.py and tokenize.py modules have to be adapted by hand when new punctuation is added. I used to define a fibermod package in the root package of the fiber that contains copies of those modules. The modules are imported at the fibers __init__.py. They substitute standard imports.

2.4  Fiber compiled modules

We have seen how to create a fiber and how to execute modules with specific language definitions in the fibers context. But allthough we needed the fiber to compile those modules they still compile to ordinary Python bytecode and we should be able to mix the compiled modules freely. Hence both of these executions calls should work in principle:

                 (A)  python fiber.py gallery_module.py

                
(B)  python gallery_module.pyc

In case (A) the whole EE framework and the fiber module can be used all the way long. Functions can be defined in the fiber and function calls can be generated into the module. But those functions/objects don't reside in the module but somewhere else. Since they are language specific they should be accessible by all modules of the language.
The EE framework offers the __publish__ fiber attribute.
__publish__ : list
The fiber attribute __publish__ is a list of names. It will be read at startup and for each name in __publish__ the following assignment will be made:  __builtin__.__dict__ [name] = fiber.name.
Now in case of (A) the execution of the __publish__ protocol can simply be triggered by the startup functions involving the fiber. The defined objects are immediately accessible in gallery_module.py. But in (B) the names are not present and execution will lfail. Thus we have to import the fiber explicitely in the module. In EE we can abstract from the import of the concrete fiber and use the statement

       import __fiber__

instead. This import statement is more a declaration than an actual execution. It will be transformed by EE according to:


import __fiber__  ==> 
import <fiber_name>.<fiber_mod> as __fiber__

The <fiber_name> and <fiber_mod> placeholders will be replaced by the actual fiber values. The new module valued __fiber__ variable can be used to reflect on the fiber. Using import __fiber__ in place of a more concrete import advice has another advantage. Lets have a new fiber F2 that extends a given fiber F1 and we want to use names defined in F2 in a module M that was compiled with F1. Than we simply have to recompile M using F2 without touching the specific fiber import.

Note: The  import __fiber__ declarative shall be defined in the module namespace. Don't use it within a class or function definition.

You might check this out by importing e.g. the test_gallery module after being compiled to bytecode.

2.5  Suffixes

There is currently no support for alternative file extensions. This limitation is due to the fact that file suffixes like "py" or "pyc" are hardcoded within the CPython runtime.

3. The console is your friend

The eeconsole module of the EE package implements a general interactive console for all kinds of fibers particular fibers. The eeconsole is a somewhat redesigned as well as enhanced version of  Pythons InteractiveConsole object. If you type a fiber command on the console prompt it has to be unparsed to Python source. The result may be one or more Python commands. So  eeconsole essentially realizes an interplay between the eetransformer that creates Python CSTs from fiber definitions and cst2source that restores Python source code from CSTs.
   

3.1  Launching a fiber specific console


You can define a prompt variable in the conf.py module. This variable is accessed by the eeconsole and the value replaces
the standard prompt sys.ps1. The second prompt sys.ps2 will be determined as a dotted line whose length corresponds
with that of sys.ps1. Starting the gallery console under Windows may look like:

c:\Python24\EE\Fibers\gallery>python fiber.py
(GenericConsole)
gal> x = 0
gal> repeat:
....    x+=1
....    print x
.... until: x==3
....
1
2
3
gal>

3.2  Debugging

Debugging using the pdb module should work as convenient but don't forget to pass globals() explicitely into pdb.run().

Example: Using module gallery.test.funcs and function

def f_repeat_until(x):
    repeat:
        x-=1
    until:
        x<7
    return x

we get the following session trace on the Gallery console:

gal> import gallery.test.funcs as funcs
gal> import pdb
gal> pdb.run("funcs.f_repeat_until(7.5)",globals())
> <string>(1)?()
(Pdb) s
--Call--
> c:\python24\lang\gallery\test\funcs.py(6)f_repeat_until()
->
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(8)f_repeat_until()
-> repeat:
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(9)f_repeat_until()
-> x-=1
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(10)f_repeat_until()
-> until:
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(12)f_repeat_until()
-> return x
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(266)f_repeat_until()
(Pdb) n
--Return--
> c:\python24\lang\gallery\test\funcs.py(266)f_repeat_until()->6.5
(Pdb) n
--Return--
> <string>(1)?()->None
(Pdb) n
gal>


4. Discussion

4.1 Caveats

EasyExtend is hooked into an existing software architecture namely that of the CPython interpreter suite. From this point of view it might be more surprising that integration is still quite smooth than it has also a few pitfalls. But there are some you should be aware about them.
  • File suffixes. It can be annoying to name DSL file extensions *.py just to make the import machinery glad that uses some hard coded suffixes. Besides being confusing this also cause conflicts with tools like distutils that have greedy compiling policies.
  • lnotab. Some code transformations confuse line recovery. The creation of the line-number-table ( lnotab ) is based on constraints that can be violated occasionally ( monotonicity violation ). This problem affects any kind of code generator for CPython not only that of EasyExtend. As a consequence line number information is not reliable and sometimes the runtime reports completely bogus line number references according to some exception. The deeper cause is that the ( Python 2.4 ) compiler uses a mixture of line number inormation provided by the parser in the parse tree and own reconstruction rules that apply while compiling to bytecodes. All this is feeded into a very clever and dense encoding scheme of lnotab that assigns line numbers to bytecodes. This problem is rather serious and the only known workaround is to unparse the transformed CST into Python source code and debug this one.

4.2 Prospectives

EasyExtend is still in its infancy. One can extend the system in each direction both in breadth as well as in depth. One might
  • target EasyExtend on other Python implementations in particular IronPython and PyPy.
  • use more powerfull LR parser like PLY to replace or supplement the current LL(1) parser.
  • develop a fiber fusion technology and migrate fibers automatically to more recent versions of Python
  • create the basic cst.py module immediately from Pythons Grammar
  • create more and also more interesting fibers than available in the current collection of samples
  • consider other target languages than Python.