Introduction
This document introduces EasyExtend, a metaprogramming system for
Python. It sketches the main lines of thought about EasyExtend and its
design ideas.
What is EasyExtend?
- EasyExtend (EE) is a preprocessor generator and
metaprogramming framework written in pure Python and integrated with
Python. The main purpose of EasyExtend is the creation of extension
languages i.e. adding custom syntax and semantics to Python. EasyExtend
comes out of the box as a single package with no dependecies other
those to Pythons standard library.
- EasyExtend respects separation of concerns. This virtue is
in fact constrained by EEs architecture. You can't use EasyExtend to
define syntax/code transformations from within usual application code
although
this
might be possible by creating a special extension language that permits
these kind of transformations. Programming programming languages is a
distinct activity.
- EasyExtend is grammar based like Lex/Yacc and ANTLR. This
makes EasyExtend quite heavy weight but also very flexible. In
principle you can define arbitrary languages using EasyExtend, not just
Python extensions. Extension languages created using EasyExtend shall
be considered as external
domain specific languages (DSLs) in contrast to embedded DSLsas in Lisp
or Ruby.
- If you are more conservatively minded and believe that
non-standard extensions of the Python language are evil, but static
analysis, refactoring tools, code coverage etc. are usefull, you
might consider code analysis + generation using EasyExtend.
- If you are more experimentally minded or even visionary and
think about unifying existing languages and pardigms within one or a
small set of languages that play well together you might be interested
in the concepts of a fiberspace,
of languages as components and source level integration.

Source tree overview and installation advices.

About tokenizers, parsers and syntax tree
constructors.

Extension languages - how to create and transform
them. The fiberspace architecture allows to combine them.

Caveats and prospects.
1. Getting Started
1.1 Installation
EasyExtend is as a Python package distributed using distutils. It will
be installed in
the site-packages
directory of
your Python distribution using a distutils aware setup.py
script. The EasyExtend package can be downloaded here. The documentation as well as
the tests are
distributed separately.
Installing EasyExtend requires typing
-
python setup.py install
on your Unix shell or DOS box prompt. Under Windows you might
use the Windows installer exectuable instead.
Documentation and tests are packed together in a separete zip
or tar.gz file.
[DocsAndTests
2.0-alpha1]+--[doc]
+--[EasyExtend]
The docs are a somewhat stripped down offline version of this site.
Copy the tests directly into the site-packages directory over the just
installed EasyExtend package.
After installation you should see roughly following source tree:
[site-packages]+-[EasyExtend]+-[fibers]
+-
[coverage]
+-
[gallery]
+- __init__.py
+- conf.py
+- fiber.py
+- PyGrammar.py
+- Grammar
+- [fibermod]
+- __init__.py
+- symbol.py
+- token.py
+- [macro]
+- [Py25Lite]
+- [teuton]
+- [transit]
+-
[parser]
+-
[util]
+-
fs
+-
__init__.py
+-
cst.py
....
1.2 First run
Switch to the directory \fibers\gallery
which is subsequent to EasyExtend ( see file-tree above ) and type python fiber.py
in your console.
C:\lang\Python24\Lib\site-packages\EasyExtend\fibers\gallery>python
fiber.py
___________________________________________________________________________________
Gallery
Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310
32 bit (Intel)]
Fiber documentation:
www.fiber-space.de/EasyExtend/doc/gallery/gallery.html
___________________________________________________________________________________
gal> a = 1
gal> switch
a:
# first example - the switch statement
.... case
0:
# for more information visit Gallery
.... print "nok"
.... case 1:
.... print "ok"
....
ok
gal>
|
Each fiber defines its own language encoded mainly in the fiber.py
module of a fiber package. The fiber.py
module is also the entry point
for calling extension language modules.
C:\lang\Python24\Lib\site-packages\EasyExtend\fibers\gallery>python
fiber.py test\test_gallery.py
test_chainlet (test.test_gallery.TestGallery) ... ok
test_importer (test.test_gallery.TestGallery) ... ok
test_mixed (test.test_gallery.TestGallery) ... ok
test_on_stmt_1 (test.test_gallery.TestGallery) ... ok
test_on_stmt_2 (test.test_gallery.TestGallery) ... ok
test_repeat_until_1 (test.test_gallery.TestGallery) ... ok
test_repeat_until_2 (test.test_gallery.TestGallery) ... ok
----------------------------------------------------------------------
Ran 7 tests in 0.000s
OK
|
1.3 Learning EasyExtend
The best way to start doing something usefull with EasyExtend might be
reading the EasyExtend tutorial
and work through the presented example. The following documentation
partially overlaps with the content of the tutorial. This is due to the
fact that most documents you can read here are actually mentioned as design documents written in an
introductory style.
2. Grammars and CSTs
2.1 Concrete vs
abstract syntax
EasyExtend is grammar based. This means that any proper language
extension requires a modification of Pythons Grammar
file. Each fiber has an own copy of Pythons Grammar and each
modification of it causes a re-generation of the PyGrammar.py module of
the fiber, which contains the parser tables according to the grammar.
When you parse an expression in Python 2.4 using the parser module of
the std-lib and convert the parser.st object output into a list you get
usually long chains of nested lists, each one starting with a number
representing a node.
>>> import parser
>>> parser.expr("1+2")
<parser.st object at 0x0099D0F0>
>>> parser.expr("1+2").tolist()
[258, [320, [298, [299, [300, [301, [303, [304, [305, [306, [307, [308,
[309, [310, [311, [2, '1']]]]], [14, '+'], [308, [309, [310, [311, [2,
'2']]]]]]]]]]]]]]], [4, ''], [0, '']]
|
A number of this kind is called a node
id within this document. Node ids are the types of nodes used in
the parse tree. For Python these node ids can be found in the symbol.py
and token.py modules which are automatically generated. They are fixed
for a particular version of Python and must never be changed since the
compiler relies on them. A parse tree created from the parser module is
called a concrete syntax trees
or CST within this document. Other that an abstract syntax tree (AST) a CST
reflects directly the structure of the underlying grammar and also the
syntax of an expression. A CST is equivalent to concrete syntax and you
can reconstruct the source code almost faithfully from it ( line
comments are skipped and also whitespace that is non-relevant for
indentation is ignored ).
Example:
|
class A:pass
|
class A:
pass
|
These two statements have the same semantics but different concrete
syntax. The difference between both will be present in the CST but not
in the AST description.
EasyExtend is based on the CST description. The main reason for this
decision is the simple fact that abstract syntax requires an additional
transformational step originating from a CST. But the whole point of
EasyExtend is to define transformations based on a modified Grammar. So
we would have to define two transformations for each fiber. One from a
CST to an AST and second from an AST to another AST that can finally be
compiled. EasyExtend enables only CST -> CST transformations and
hides most of the the nitty-gritty details of CSTs from the
developer using an operational approach. In the rare occasions where
the difference between abstract and concrete syntax really matters as
in the example above one might apply normalize()
which is defined in csttools.py. The normal form is always the most
general form. In case of a suite it is a list of statements, not an
expression within a single line.
2.2 Grammar extensions
Pythons syntax is not completely defined by Grammar. Indeed some
aspects are factored into the tokenization process. This has the
advantage of keeping Pythons Grammar very simple - it is indeed LL(1).
Hence the strongest obstacle for parsing Python namely its indentation
rules have a residual character in the Grammar namely as INDENT and
DEDENT token that can be considered as moral equivalents of opening and
closing braces. Otherwise this split leads to at least two files that
need to be changed when both new nonterminals and new terminals of the
grammar shall be introduced.
2.2.1
grammarObj generation
The EBNF grammar used by CPython and EasyExtend can be found in the
file Grammar. Each fiber has an own version of this Grammar file. It
contains a number of grammar rules for our LL(1) parser.
.
.
.
fplist: fpdef (',' fpdef)* [',']
stmt: simple_stmt | compound_stmt
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
small_stmt: expr_stmt | print_stmt | del_stmt | pass_stmt |
flow_stmt | import_stmt | global_stmt | exec_stmt | assert_stmt
expr_stmt: testlist (augassign (yield_expr|testlist) | ('='
(yield_expr|testlist))*)
augassign: '+=' | '-=' | '*=' | '/=' | '%=' | '&=' | '|=' | '^=' |
'<<=' | '>>=' | '**=' | '//='
# For normal assignments, additional restrictions enforced by the
interpreter
print_stmt: 'print' ( [ test (',' test)* [','] ] | '>>' test [
(',' test)+ [','] ] )
del_stmt: 'del' exprlist
pass_stmt: 'pass'
flow_stmt: break_stmt | continue_stmt | return_stmt | raise_stmt |
yield_stmt
break_stmt: 'break'
continue_stmt: 'continue'
return_stmt: 'return' [testlist]
.
.
.
|
These rules form a declarative syntax description language; a kind of a
domain specific language for language syntax. The Grammar reader is
implemented in parser/PgenParser.py. The output of PgenParser.py serves
as input for parser/PyPgen.py, which creates the parser table used by
parser/DFAParser.py to parse source code. As a parser table a Python
tuple datastructure grammarObj is
stored as plain text in the module PyGrammar.py of the corresponding
Grammar. So we always have a mapping
Grammar
-> PyGrammar.py and whenever it has been noticed that
Grammar has changed a new grammarObj
is created and stored in PyGrammar.py.
2.2.2 Rule Sequence
The sequence of rules as they are placed in Grammar may not be changed.
Whenever you want to add a new grammar rule you have to append the rule at the end of the
file. You can also add content to an existing rule and delete existing
clauses but you may never insert in or delete a rule from the mid of of
Grammar.
The reason for this restriction is compiler compliency. The CPython
compiler deals exclusively with the numerical token corresponding to a
rule. This token can be considered as an index + offset into an array
of rules. Changing the index has fatal consequences and causes a
SyntaxError when detected during compilation.
|
single_input
|
=
|
256
|
|
file_input
|
=
|
257
|
|
eval_input
|
=
|
259
|
|
decorator
|
=
|
260
|
|
...
|
|
|
Sometimes however you want just a minimalist grammar. As an example
consider the grammar of Grammar:
# Grammars grammar
single_input: NEWLINE
file_input: ( RULE | NEWLINE )* ENDMARKER
RULE: NAME ':' RHS NEWLINE
RHS: ALT ( '|' ALT )*
ALT: ITEM+
ITEM: '[' RHS ']' | ATOM [ '*' | '+' ]
ATOM: '(' RHS ')' | NAME | STRING
|
This file describes the grammar of any Grammar file. Note that only
rules form the end of the
Grammar file are deleted. The single_input rule
is a dummy, a placeholder used to preserve the minimal sequence.
|
single_input
|
=
|
256
|
|
file_input
|
=
|
257
|
Otherwise you might suspect that source code being parsed with this
particular Grammar will be mapped directly to Python objects for
further treatment and not being compiled to bytecode before.
2.2.3
Appending rules , inserting clauses
A grammar is just another declarative little language. The following
example shows a modified Grammar
script where four rules
are appended at the end of the file.
# Grammar for Gallery
...
# A modified Python grammar rule
compound_stmt: if_stmt | while_stmt | for_stmt | try_stmt | funcdef |
classdef | on_stmt |
repeat_stmt |
switch_stmt
...
testlist1: test (',' test)*
# not used in grammar, but may appear in "node" passed from Parser to
Compiler
encoding_decl: NAME
######## ----- new grammar definitions ------- #########
on_stmt: 'on' NAME
'='
test ':' suite ['else' ':' suite]
repeat_stmt:
'repeat' ':'
suite 'until' ':' (NEWLINE INDENT test NEWLINE DEDENT | test NEWLINE )
switch_stmt:
'switch'
expr ':' NEWLINE INDENT case_stmt DEDENT ['else' ':' suite]
case_stmt: 'case'
expr
':' suite ('case' expr ':' suite)*
|
Each of these rules has to be connected to the rest of the grammar.
This is done by inserting their identifiers as clauses in
compound_stmt. You are free to add them anywhere where this makes
sense. If you construct an ambigous grammar the DFAParser will complain
about an ambigous grammar ( printing message "XXX ambiguity!"
)
2.3 CST manipulation
Transformation of CSTs involve analysis and synthesis of CSTs. Analysis
requires just a small set of generic functions and knowledge about the
structure of a grammar rule which can be read directly from the Grammar
file. CST synthesis is the manual creation of valid CSTs from CST
fragments. While there is a canonical base to start with on synthesis,
namely functions associated with node types - for each node type there
is a function that creates the node accordingly - using this base
directly can be tedious and error prone. It is also hard to read. In
version 1.0 of EasyExtend this base was everything supplied for CST
synthesis and the result can still be examined in
EasyExtend/fibers/gallery/fiber.py. In the meantime we investigated
several approaches that shall be introduced in subsequent sections. But
let's do some groundwork first.
2.3.1
Elementary node constructors
Each node of the Python Grammar has a corresponding function in the
module cst.py. For example the grammar rule
simple_stmt: small_stmt (';' small_stmt)* [';'] NEWLINE
has a corresponding function
def simple_stmt(*args):
....
which creates exactly nodes of the type simple_stmt.
We call these functions elementary
node constructors. The arguments passed into the simple_stmt
node constructor are nodes of type small_stmt
and only those. It is required not
to pass colons or NEWLINE into simple_stmt since these are added
automatically. As a general rule for any
cst function following holds:
Whenever
a terminal symbol can be derived from the context you must not pass it into a cst
function. |
One can see easily when to pass colons, commas, parens etc.
Those rules have following structure:
rule: T1 NT1 ... | T2 NT1 .. | T3 NT1 ...
where Ti
are terminals and NT1 is
a non-terminal. You need to pass a terminal to select the concrete
rule. context.
Example:
atom: '('
[testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}'....
Here braces, brackets or parens must be supplied. The pairs '(' ')',
'[', ']' and '{', '}' are already complete characterizations of the
node because the enclosed node is optional.
Example of usage:
from
EasyExtend.cst import*
n_42 = atom(Number(42))
n_2 = factor(power(atom(Number(2)))))
simple_stmt(small_stmt(expr_stmt(test(...(arith_expr(term(factor(power(n_42,n_2)))))...)))
# o.k. stop here ...
|
As said above do not use elementary node constructors in your fiber
definition directly unless you can't avoid them - you can most of the
time.
2.3.1.1 Decorating CST functions
Each elementary CST function is decorated with py_cst. The reason for
this is that node ids for a particular node are not unique! We have
indeed a whole spectrum of possible node ids are assigned to the same
node - one for each fiber ( see Fibers+Fiberspace
for more information ).
2.3.2 Node wrapper
The test
node represents a kind of upper limit of expressions in Python. Each node,
subsequent
to the test
node,
is guaranteed to be a terminal or an expression. This implies that
almost all nodes nodes that use an expression also use a test node or a
subsequent node. For wrapping a node into a test
node we have defined the csttools.py function any_test:
-
- Returns a test
node whenever the input is one of the following nodes: test, and_test,
lambdef, not_test, comparison, expr, xor_expr, and_expr, shift_expr,
arith_expr, term, factor, power, atom, Number, Name, String.
Similarly any_stmt
is defined for wrapping statements.
-
- Returns a correctly built stmt
node whenever the input is one of the following nodes: stmt, simple_stmt,
compound_stmt, small_stmt, if_stmt, for_stmt, while_stmt, try_stmt,
break_stmt, continue_stmt, return_stmt, raise_stmt, yield_stmt,
expr_stmt, print_stmt, del_stmt, pass_stmt, flow_stmt, import_stmt,
global_stmt, exec_stmt, assert_stmt.
Finally we can combine both wrapper functions and provide an additional
selector:
-
- Combines any_test and any_stmt but returns a node of type node_id
instead of test or stmt.
Nodes that neither fall into the category of test or stmt can't be
wrapped. In that case one has to use elementary node constructors.
Example:
from
EasyExtend.csttools
import*
# wrapper functions are defined in csttools
n_42 = atom(Number(42))
n_2 = any_node(2,
py_symbol.factor)
# we use the alias py_symbol instead
import parser
# of symbol within
cst and csttools
input = eval_input( testlist(any_test( power(n_42,n_2) ))
eval(parser.sequence2st(input).compile()) -> 1764
|
2.3.3 The cstgen module
It is hard to compress the previous expression sequence in terms of
node definitions. However it is actually possible using the cstgen
module. The cstgen module is the moral equivalent of an AST builder.
But the expressions returned from cstgen are again CSTs.
from
EasyExtend.csttools
import*
# wrapper functions are defined in csttools
from
EasyExtend.cstgen import*
# wrapper functions are defined in csttools
import parser
input =
CST_Eval( CST_Power(42,2) )
eval(parser.sequence2st(input).compile()) -> 1764
|
The set of cstgen functions is still incomplete. We see here how those
can be used to hide details of CST composition by node adaption. The
logical next step would be some fiber that drops the functional syntax
and uses symbols to express relationships between CST nodes:
(e_(42)
** 2).cst
->
CST_Power(42, 2)
(e_(42) ** (e_(2) + e_("a"))).cst -> CST_Power(42,
CST_Add(2, Name("a")))
e_("F")(42, 2).cst
-> CST_CallFunc("F",[Number(42),
Number(2)])
...
This is the upper limit of adaption towards normal Python
syntax using
CST node synthesis only. The macro
fiber works analytically using real but slightly enhanced Python source
code
and analysis and node variable substitution.
2.3.2 CST Analysis
While CST synthesis can be understood in many different ways and spans
different levels CST analysis is relatively straightforward.
-
find_node( |
tree,
node_id, level=10000, exclude=[]) |
- Seeks for a node with id = node_id
in tree. It returns the first
node found in a search is depth-first search or None.
- The
maximal depth of the searched tree is constrained by level which has a high value by
default. If level is set 0 only
- the direct subnodes of tree will be inspected. The optional
exclude parameter is a list containing nodes excluded from search. When
e.g. exclude = [ symbol.funcdef ] then find_node does not seek for
node_id in subnodes of funcdef.
-
find_all( |
tree,
node_id, level=10000, exclude = []) |
- Returns a generator that seeks and yields any node with id
= node_id in tree. The search mode is
depth-first.
- The
maximal depth of the searched tree is constrained by level which has a high value by
default. If level is set 0 only
- the direct subnodes of tree will be inspected.The optional
exclude parameter is a list containing nodes excluded from
search. When e.g. exclude = [ symbol.funcdef ] then find_node does not
seek for node_id in subnodes of funcdef.
Variants of find_node
and find_all
are the functions find_one_of
and find_all_of.
d in many different ways and spans different levels CST analysis is
relatively straightforward.
-
find_one_of( |
tree,
node_ids =[], level=10000, exclude=[]) |
- Like find_node
but accepts all node ids as search parameters passed in the list
node_ids.
-
find_all_of( |
tree,
node_ids =[], level=10000, exclude=[]) |
- Like find_all
but accepts all node ids as search parameters passed in the list
node_ids.
2.4 Tokenization
In section 1.2 we described how to create new grammar rules inserted
into the Grammar file of the fiber. This way we could define new
nonterminals. The tokenizer used by EasyExtend is an extended version
of the
tokenizer found in tokenize.py of the standard library. It is split
into two modules eetokpattern.py and eetokenizer.py. The token.py
module of Pythons stdlib is enhanced to eetoken.py.
2.4.1 Token definitions
When you activate the -t option and run a fiber on the console the
token stream for each input will be shown:
C:\lang\Python24\Lib\site-packages\EasyExtend\fibers\gallery>python
fiber.py -t
___________________________________________________________________________________
Gallery
Running on Python 2.4.3 (#69, Mar 29 2006, 17:35:34) [MSC v.1310
32 bit (Intel)]
Fiber documentation:
www.fiber-space.de/EasyExtend/doc/gallery/gallery.html
___________________________________________________________________________________
gal> def f():
.... pass
[token>
----------------------------------------------------------------------------------------.
Line | Columns | Token Value | Token Name
| Token Id | Token class | Class Id |
-------+---------+-------------+---------------+-------------+-------------+------------+
1 | 0-3 |
'def' |
NAME | 3585 --
1 | NAME | 3585
-- 1 |
1 | 4-5 |
'f' |
NAME | 3585 --
1 | NAME | 3585
-- 1 |
1 | 5-6 |
'(' |
LPAR | 3591 --
7 |
OP | 3635 -- 51 |
1 | 6-7 |
')' |
RPAR | 3592 --
8 |
OP | 3635 -- 51 |
1 | 7-8 |
':' |
COLON | 3595 --
11 | OP |
3635 -- 51 |
1 | 8-9 |
\n |
NEWLINE | 3588 -- 4 |
NEWLINE | 3588 -- 4 |
2 | 0-4 |
' ' |
INDENT | 3589 --
5 | INDENT | 3589 -- 5 |
2 | 4-8 |
'pass' |
NAME | 3585 --
1 | NAME | 3585
-- 1 |
2 | 8-9 |
\n |
NEWLINE | 3588 -- 4 |
NEWLINE | 3588 -- 4 |
3 | 0-0 |
'' |
DEDENT | 3590 --
6 | DEDENT | 3590 -- 6 |
3 | 0-0 |
'' |
ENDMARKER | 3584 -- 0 |
ENDMARKER | 3584 -- 0 |
----------------------------------------------------------------------------------------'
<token]
....
gal>
|
Each non-alphanumeric character is expressed by a token with an
own characteristic token id. Whitespaces are dropped unless they
indicate indentation. However they can be easily reconstructed from
column information.
The develop
...
Operator
= group(r"\*\*=?", r">>=?", r"<<=?", r"<>", r"!=",
r"//=?",
r"[+\-*/%&|^=<>]=?",
r"~")
Bracket
= '[][(){}]'
Special
= group(r'\r?\n', r'[:;.,`@]')
Funny
= group(Operator, Bracket, Special)
PlainToken
= group(Number, Funny, String, Name)
...
|
3. Fibers +
Fiberspace
|
 |
3.1 Fiber transformers
We have described how to define new syntax and analyze and synthesize
CSTs. Now we consider the semantical and transformational aspect of
EasyExtend.
3.2 The fiberspace
We have already called an extension language in EasyExtend a fiber. A fiber is a Python package
defined by a minimal collection of modules. In the previous section we
learned about the relationship between
Grammar and PyGrammar.py and how to use the basic tools for CST
synthesis to create new syntax trees.
+-
[any-fiber]
+- __init__.py
+- conf.py
+- fiber.py
+- PyGrammar.py
+- Grammar
+- [fibermod]
+- __init__.py
+- symbol.py
+- token.py
Next we will consider some details of the conf.py as well as the
symbol.py module in detail.
+-
[any-fiber]
+- __init__.py
+- conf.py
+- fiber.py
+- PyGrammar.py
+- Grammar
+- [fibermod]
+- __init__.py
+- symbol.py
+- token.py
3.1.1 Fiber offsets
When you open the symbol.py
module of an arbitrary fiber like Gallery you will notice a copy of
Pythons symbol.py
module just with different node ids:
#
This file is automatically generated; please don't muck it up!
#--begin constants--
single_input = 3840
file_input = 3841
eval_input = 3842
decorator = 3843
decorators = 3844
...
...
testlist1 = 3916
encoding_decl = 3917
on_stmt = 3918
repeat_stmt = 3919
switch_stmt = 3920
case_stmt = 3921
#--end constants--
|
These node ids are not selected by chance. When you open conf.py of the
same fiber the file starts with following text
########################################################################
#
#
Language configuration
#
########################################################################
# --- unique fiber offset ---
# Warning! do not touch this. It will be automatically generated by
fs.py
FIBER_OFFSET = 3584
# modules and functions
import os
import sys
...
|
When you add 256 to 3584 you will get 3840 which is exactly the node id
of the first node single_input of symbol.py. Moreover the FIBER_OFFSET
is a multiple of 512: 3584 = 7*512.
All these data are generated automatically and you don't have to care
about them a lot, unless you debug fiber transformers. Each fiber is
characterized by a unique FIBER_OFFSET which is a multiple of 512 and
used to identify its nodes in an arbitrary CST. The unique segment for
each fiber contains 512 possible nodes. The lower half is reserved for
token, while the upper half is reserved for symbols.
|
X
in [FIBER_OFFSTET, FIBER_OFFSET+255]
|
X
is token
|
|
X
in [FIBER_OFFSTET+256, 2*FIBER_OFFSET-1] |
X
is symbol
|
Currently an average fiber uses ~60 token and ~80 symbols so there is
still much space left.
3.1.2 Projection
Among the numbers that matter here is also MAX_PY_SYMBOL defined in
cst.py
MAX_PY_SYMBOL = 333 # maximal node_id of Python
2.4 symbols.
This number is the maximum node id found in Pythons
symbol.py and is release specific. For example Python 2.5 defines some
new grammar rules and its MAX_PY_SYMBOL is 339. The MAX_PY_SYMBOL is so
important because it is related to node id projection. The node ids of
a fiber are meaningless to the Python compiler. The compiler does not
know how deal with a node id 3840. Before we compile a fiber CST we
have to project it onto a Python CST. This projection is very simple
because we just need to keep the rest of a node id by division with 512.
projection
CSTFiber -------------> CSTPython
Projection can also be used to indicate transformation errors. If a
projected CST contains node ids exceeding MAX_PY_SYMBOL the
transformation is still not accomplished.
3.1.3 The idea of a fiberspace
When each fiber defines its own distinctive set of node ids,
transformations can be assigned uniquely to these nodes. When trying to
mix different fibers within the same transformation context we need a
central transformer that holds all particular fiber transformers and
dispatches nodes to specified transformations of these FiberTransformer
instances. This central transformer is called FiberspaceTransformer or just
FSTransformer.
The eetransformer.py module creates a single instance of FSTransformer
called fs_transformer. The run()
method of any FiberTransformer simply delegates the call to
fs_transformer. There is also some registration and scheduling of
FiberTransformers involved but the essential idea is that the
fs_transformer coordinates each node transformation.
3.1.4 Ranging
When a module gets parsed into a CST the domain of the node ids is
always completely determined by the main fiber Fmain.
Let now another fiber F'. enter
the fiberspace. How can a node be dispatched to a node transformer of F' when F' defines
a different set of node ids? The answer is called ranging. For each node id n we
define the base n0 =
n%512 and the range
[0,n0+FIBER_OFFSETF1, n0+FIBER_OFFSETF2,...]
defines all admissable node ids for which a transformer may be defined.
For two node ids n1
and n2 we
write n1
~ n2 when
they are in the same range.
If a node id n gets matched exactly the corresponding node transformer
will be selected, when available, otherwise ranging will be applied to
select the first transformer available for a node in the range. Ranging
is important also on another occasion, where some CST fragment has to
be inserted into an existing CST. In this case all node ids in its
range are considered equivalent and if n1
is the root node of the CST fragment it may replace any other fragment
with root n2
with n1
~ n2.
Caution: The above transformer selection does
not deal with competition among transformers applying to the same node
id but is a simple greedy algorithm. A more advanced method to deal
with competition has still to be investigated.
3.2 The framework
If you consider a framework as an "application with holes" the holes of
the EE framework are the particular fibers that contain the language
definition. While the EasyExtend package is a class and function
library used by any fiber the complete fiber is passed back to
EasyExtend as a module! So the fiber
module serves as the language interface that suffices a
few conventions. A minimal fiber has to look like this:
#
minimal fiber that provides EE framework interface
import EasyExtend # EasyExtend package
import
conf # defines
some fiber properties e.g. the console prompt, paths etc.
import PyGrammar # contains
fiber specific DFA tables used by the parser
class FastTransformer(EasyExtend.eetransformer.Transformer):
'''
Defines fiber specific transformations
'''
|
Since we have to pass the fiber to EE we need to import it somehow.
This can be done in a separate main.py for instance or we import the
fiber within the fiber and pass it afterwards:
#
minimal fiber that provides EE framework interface
import EasyExtend # EasyExtend package
import
conf # defines
some fiber properties e.g. the console prompt, paths etc.
import PyGrammar # contains
fiber specific DFA tables used by the parser
class FastTransformer(EasyExtend.eetransformer.Transformer):
'''
Defines fiber specific transformations
'''
if __ name__ == '__main__':
import sys
import minimal_fiber
# import
this fiber
if len(sys.argv)>1:
py_module = sys.argv[1]
EasyExtend.run_module(py_module, fiber)
else:
console =
EasyExtend.create_console("MinimalFiber", fiber)
console.interact()
|
As long no deviant Grammar is defined, the behaviour of the
minimal fiber shall be the same as that of the Python interpreter.
3.2.1 eetransformer
The Transformer module provides mappings between the CST of a fiber F and a Python
CST. For this purpose the F-CST will be
traversed and the F
specific nodes will be replaced. Each node that shall be visited has to
be made public using a
method signature like this one that is defined in the Gallery fiber:
-
- Transformation of the content of node where node is the repeat_stmt node of the fiber CST.
According to the transformation rule for the repeat statement
defined for Gallery it has to be replaced by a while_stmt node of a Python CST.
An essential part and the main difficulty of the transformer algorithm
is to move the returned nodes to the right place in the CST. This
aspect is not solved
analytically for any kind of node but it turned out that a stable
heuristics is applicable.
Depending on the node type that shall be replaced return either:
- A node of type test which
represents the most general form of a Python expression in the CST.
- One or more nodes of type stmt
that represent arbitrary Python statements in the CST.
For convenience the surgery module offers methods any_test
and any_stmt
that wrap expressions or statements of any kind such that test
and stmt
nodes are left as result:
-
- Returns a correctly built test
node whenever the input is one of the following nodes: test, and_test,
lambdef, not_test, comparison, expr, xor_expr, and_expr, shift_expr,
arith_expr, term, factor, power, atom. Additionally each node
that is part of the atom
grammar rule that is at the bottom of the chain above such as Name, Number
etc. fits as well.
-
- Returns a correctly built stmt
node whenever the input is one of the following nodes: stmt, simple_stmt,
compound_stmt, small_stmt, if_stmt, for_stmt, while_stmt, try_stmt,
break_stmt, continue_stmt, return_stmt, raise_stmt, yield_stmt,
expr_stmt, print_stmt, del_stmt, pass_stmt, flow_stmt, import_stmt,
global_stmt, exec_stmt, assert_stmt.
Example: in Gallery the
repeat loop is transformed as follows:
class
FastTransformer(Transformer):
def handle_repeat_stmt(self, node):
"repeat_stmt: 'repeat' ':'
suite 'until' ':' (NEWLINE INDENT test NEWLINE DEDENT |
test NEWLINE )"
_suite =
find_node(node,symbol.suite)
_test =
find_node(node, symbol.test, level=0)
_until = if_stmt(_test,
suite(any_stmt(break_stmt())))
_suite.insert(-1,
any_stmt(_until))
return
any_stmt(while_stmt(any_test(Number(1)),_suite))
|
3.2.1.1 Being
__like_main__
With EasyExtend the combination python main.py represents
a specialized interpreter not alone python.
But running e.g.
python
main.py gallery_module.py has
the drawback that the convenient block statement of any gallery_module.py
__name__
== '__main__':
...
if
present in
gallery_module.py ,
will not be executed since gallery_module.py
is not the __main__
module but main.py
! The
EE solution is a generic transformation of the block
statement into a function called
__like_main__.
This
function is not dedicated to be defined explicitely by the user (
please don't don't define it. Otherwise I select a weirder name ) but
created by
the
eecompiler during parse-tree transformation:
|
if __name__ ==
__main__:
BLOCK |
==>
|
def __like_main__():
BLOCK
if __name__ == '__main__':
__like_main__()
|
Running the compiled module gallery_module.pyc from python
will
cause
execution of __like_main__()
because gallery_module.py is now the __main__
module. If otherwise gallery_module.py is
called from python main.py the __like_main__
function is
called by main.py:
#
main.py
...
if __name__ == '__main__':
mod_name = sys.argv[1]
mod = create_module(mod_name)
try:
mod.__like_main__()
except AttributeError:
pass
|
3.3 The eeimporter
The eeimporter
module and the Importer
class defined in this module play an important role in defining the
boundaries of the fiber. Let m = m(F) be a module
of the fiber F then
we need to know how m has to import
dependent modules. There are a some basic alternatives:
- All files belong to a fiber F. This is
basically what Python asserts.
- Define a FIBERPATH environment variable. Each module in the
specified or subsequent directories will be imported as a module of
fiber F.
- Define a regular expression to identify naming
characteristics. That's how the coverage
fiber defines dependencies between a module text_A.py
and a module A.py.
- Use a general criterion. Gallery assumes that
Fibers\gallery is the root directory where gallery related modules are
defined. Some of those modules are excluded from import.
- Others to be defined.
An instance of an Importer
class or a derived class will be added to sys.meta_path
to create an "import hook". This is a bit of an undocumented Python
feature I'm not going to explain here. What I know about the behaviour
of those import hooks I derived from experimentation and reading of the
testcode of test_importhooks.py.
Note: If you try to import a module that was
already imported at another place Python takes the module from
sys.modules. The import hook won't work in this case.
Here is a list of
modules that are already imported when the coverage fiber Importer
registers itself at sys.meta_path.
Ironically all of the framework modules were imported so that quality
of the EE unittests cannot be checked with coverage - at least not
directly.
The Importer
module provides two methods that are relevant if you want to define
your own custom imports:
-
- Registers this importer at sys.meta_path.
-
- Defines fiber specific acceptance rules for module at fs_path where fs_path is the file systems path of
the module.
The accept_module
method replaces the find_module
method as the method that shall be overwritten in Importer subclasses.
2.2.3 The conf module
The conf module is used to define some basic fiber related settings.
Currently only the prompt
attribute and the ext
attribute will be supported where prompt
refers to the sys.ps1 that will be shown in the interactive console and
ext
refers to the bytecode file extension.
import
gallery
>>> gallery.conf.prompt
'gal> '
|
2.2.4 fibermod
Fibers that define own grammar rules / token need specialized versions
of Pythons stdlib modules symbol.py,
token.py.
The symbol.py
module will always be newly generated whenever a Grammar file changes.
The token.py
and tokenize.py
modules have to be adapted by hand when new punctuation is added. I
used to define a fibermod
package in the root package of the fiber that contains copies of those
modules. The modules are imported at the fibers __init__.py. They
substitute standard imports.
2.4 Fiber compiled
modules
We have seen how to create a fiber and how to execute modules with
specific language definitions in the fibers context. But allthough we
needed the fiber to compile those modules they still compile to
ordinary Python bytecode and we should be able to mix the compiled
modules freely. Hence both of these executions calls should work in
principle:
(A) python fiber.py gallery_module.py
(B) python
gallery_module.pyc
In case (A) the whole EE framework and the fiber
module can be used all the way long. Functions can be defined in the
fiber and function calls can be generated into the module. But those
functions/objects don't reside in the module but somewhere else. Since
they are language specific they should be accessible by all modules of
the
language.
The EE framework offers the __publish__
fiber attribute.
-
- The fiber attribute __publish__ is a list of names. It will
be read at startup and for each name
in __publish__ the following assignment will be made: __builtin__.__dict__
[name] = fiber.name.
Now in case of (A) the execution of the __publish__
protocol can simply be triggered by the startup functions involving the
fiber. The defined objects are
immediately accessible in gallery_module.py.
But in (B) the names are not present and execution will lfail.
Thus we have to import the fiber explicitely in the module. In EE we
can abstract from the import of the concrete fiber and use the statement
import __fiber__
instead. This import statement is more a
declaration than an actual execution. It will be transformed by EE
according to:
|
import __fiber__ |
==>
|
import
<fiber_name>.<fiber_mod> as __fiber__
|
The <fiber_name> and <fiber_mod>
placeholders will be replaced by the actual fiber values. The new module valued __fiber__
variable can be used to reflect on the fiber. Using import __fiber__ in
place of a more concrete import advice has another advantage. Lets have
a new fiber F2
that extends a
given fiber F1
and we want to
use names defined in F2
in a
module M that was compiled
with F1.
Than we simply have
to recompile M using F2 without
touching the specific
fiber import.
Note: The import __fiber__ declarative shall be defined in the
module namespace. Don't
use it within a class or function definition.
You might check this out by importing e.g. the test_gallery
module after being compiled to bytecode.
2.5 Suffixes
There is currently no support for alternative file extensions. This
limitation is due to the fact that file suffixes like "py" or "pyc" are
hardcoded within the CPython runtime.
3. The console is your friend
The eeconsole
module of the EE
package implements a general interactive console for all kinds of
fibers
particular fibers. The eeconsole is
a somewhat redesigned as well as enhanced version of Pythons InteractiveConsole
object. If you type a fiber command on the
console prompt it has to
be unparsed to Python source. The result may be one or more Python
commands. So eeconsole
essentially realizes an interplay between the eetransformer that
creates Python CSTs from fiber definitions and cst2source that restores
Python source code from CSTs.
3.1 Launching a fiber specific
console
You can define a prompt
variable in the conf.py
module. This variable is accessed by the eeconsole
and the value replaces
the standard prompt sys.ps1.
The second prompt sys.ps2
will
be determined as a dotted line whose length corresponds
with that of sys.ps1.
Starting
the gallery
console under
Windows may look like:
c:\Python24\EE\Fibers\gallery>python
fiber.py
(GenericConsole)
gal> x = 0
gal> repeat:
.... x+=1
.... print x
.... until: x==3
....
1
2
3
gal>
|
3.2 Debugging
Debugging using the pdb module
should work as convenient but don't forget to pass globals()
explicitely into pdb.run().
Example: Using module gallery.test.funcs
and function
def f_repeat_until(x):
repeat:
x-=1
until:
x<7
return x
we get the following session trace on the Gallery console:
gal> import gallery.test.funcs as
funcs
gal> import pdb
gal> pdb.run("funcs.f_repeat_until(7.5)",globals())
> <string>(1)?()
(Pdb) s
--Call--
> c:\python24\lang\gallery\test\funcs.py(6)f_repeat_until()
->
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(8)f_repeat_until()
-> repeat:
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(9)f_repeat_until()
-> x-=1
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(10)f_repeat_until()
-> until:
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(12)f_repeat_until()
-> return x
(Pdb) n
> c:\python24\lang\gallery\test\funcs.py(266)f_repeat_until()
(Pdb) n
--Return--
> c:\python24\lang\gallery\test\funcs.py(266)f_repeat_until()->6.5
(Pdb) n
--Return--
> <string>(1)?()->None
(Pdb) n
gal>
|
4. Discussion
4.1 Caveats
EasyExtend is hooked into an existing software architecture namely that
of the CPython interpreter suite. From this point of view it might be
more surprising that integration is still quite smooth than it has also
a few pitfalls. But there are some you should be aware about them.
- File suffixes. It can be annoying to name DSL file
extensions *.py just to make the import machinery glad that uses some
hard coded suffixes. Besides being confusing this also cause conflicts
with tools like distutils that have greedy compiling policies.
- lnotab. Some code transformations confuse line recovery.
The creation of
the line-number-table ( lnotab ) is based on constraints that can be
violated occasionally ( monotonicity violation ). This problem affects
any kind of code generator for CPython not only that of EasyExtend. As
a consequence line number information is not reliable and sometimes the
runtime reports completely bogus line number references according to
some exception. The deeper cause is that the ( Python 2.4 ) compiler
uses a mixture of line number inormation provided by the parser in the
parse tree and own reconstruction rules that apply while compiling to
bytecodes. All this is feeded into a very clever and dense encoding
scheme of lnotab that assigns line numbers to bytecodes. This problem
is rather serious and the only known workaround is to unparse the
transformed CST into Python source code and debug this one.
4.2 Prospectives
EasyExtend is still in its infancy. One can extend the system in each
direction both in breadth as well as in depth. One might
- target EasyExtend on other Python implementations in
particular IronPython and PyPy.
- use more powerfull LR parser like PLY to replace or
supplement the current LL(1) parser.
- develop a fiber fusion technology and migrate fibers
automatically to more recent versions of Python
- create the basic cst.py module immediately from Pythons
Grammar
- create more and also more interesting fibers than available
in the current collection of samples
- consider other target languages than Python.
|