Introduction
This is an introductory level tutorial for EasyExtend. It will guide
you
through the creation of an extension language that is simple enough to
be understood in any detail but also complex enough to shed light on
many aspects of EasyExtend. After reading this tutorial you shall be
able to create
your own extension languages, called langletsin
the EasyExtend context. For matters of brevity the tutorial won't cover
all aspects of EasyExtend that are explained in more detail in the
Trail parser document or the project overview document.
Prerequisites:
- understanding the Python language on
tutorial level
- understanding the very
basics of grammars, lexers and parsers
- having installed
the
EasyExtend package on your machine
- opened
a Python console and an
editor.
1.
Create a langlet
Basically
a langlet
encodes an
extension language or a code generator within the EasyExtend framework.
It might be helpful to think of a langlet as a Python dialect that
adds some new syntactical and/or semantical conventions to the
language.
EasyExtend
preprocesses source files written in langlet source code and generates
proper Python code that will be
compiled to bytecodes.
Physically a langlet is represented by a small set of files on the disk
which are
mandatory for each langlet definition. EasyExtend implements a package
level command named new_langlet
which creates a langlet for you. The langlet we create in this
tutorial is called hex_langlet.
The hex_langlet is a domain specific Python extension used to deal with
hexadecimal objects as bitmaps.
We start the
creation process from a Python console:
>>>
import EasyExtend
>>>
EasyExtend.new_langlet("hex_langlet", prompt = "hx> ",
source_ext = "phx")
*** Modify
C:\lang\Python25\lib\site-packages\EasyExtend\langlets\hex_langlet\lexdef\lex_nfa.py
***
*** Modify
C:\lang\Python25\lib\site-packages\EasyExtend\langlets\hex_langlet\parsedef\parse_nfa.py
***
New langlet 'hex_langlet' created:
...
[EasyExtend]+-[langlets]
+- [hex_langlet]
+- __init__.py
+- run_hex_langlet.py
+- conf.py
+- langlet.py
+- [lexdef]
+- __init__.py
+- lex_symbol.py
+- lex_token.py
+- lex_nfa.py
+- Token.ext
+- [parsedef]
+- __init__.py
+- parse_symbol.py
+-
parse_token.py
+- parse_nfa.py
+- Grammar.ext
+- [reports]
+- [tests]
>>> |
This
takes 3-4 seconds on my machine.
Three
input parameters were set in the
new_langlet function call. Those are
"hex_langlet"
|
|
the
name
of our new langlet. This parameters is mandatory |
prompt
= "hx> " |
|
prompt
used in hex_langlet consoles ( Remark the blank after "hx>"! ).
If
omitted the Python standard prompt
">>> " is used |
source_ext
= "phx" |
|
file
extension for langlet source files. If omitted the Python standard
source file extension py is used. |
Hex
objects shall be hybrids between numbers and strings.
1.1
A first start
C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python
run_hex_langlet.py
__________________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
__________________________________________________________________________________
hx>
0x00
0
hx>
|
When we type in
0x00 0x00 we see a typical error message of the Trail parser in
EasyExtend 3.0
hx>
0x00 0x00
Traceback (most recent call last):
...
File
"C:\lang\Python25\lib\site-packages\EasyExtend\trail\cfaparser.py",
line 259, in handle_error
raise
ParserError(s+rule, **{"token": tok, "selection": selection})
SyntaxError:
Failed to parse input '0x00' at (line 1 , column 9).
0x00 0x00
^^^^
Expected symbol(s) from
applied grammar rule: simple_stmt: small_stmt ( T_NEWLINE |
';' )
~~~~~~~~~~~~~~~~~~~
hx> |
The
Trail parser complains that it can't parse '0x00'. It expects either a
T_NEWLINE token i.e. a line break '\n' or a semicolon ';' while
applying grammar rule simple_stmt. To put it short: after typing a
single hex number the parser expects that the statement is finished.
Part of our efforts will be the avoidance of this Error message while
typing the same input in the hx-console.
1.2
Functional requirements
Hex
objects shall provide arithmetic operations for adding ( __add__ ) ,
multiplying ( __mul__) ,
etc. Hex objects. They shall also provide list operations for slicing,
concatenation, subscripting ( __getitem__ ) etc. The Hex object shall
convert ints and longs but also appropriate strings that can represent
hex numbers on constructor call.
2.
Defining Syntax
After
having roughly described functional requirements we care for a look
& feel. Hex objects shall feel like being native datatypes. Use
cases are described in the box below.
hx>
0x00
# 0x00 is a Hex object represented by the hex-literal notation
0x00
hx> h = 1
2
#
sequences of numbers are allowed and will be converted into Hex objects
hx> h
0x01 0x02
hx> h ++ 0x03 0x04 # new binary operator
'++' defines
concatenation among Hex
objects
0x01 0x02 0x03 0x04
hx> h[0]
#
subscripts
0x01
hx>
h+11
# a little arithmetics
0x01
0x0d
|
2.1
Syntax Requirements
We summarize the syntactical
requirements as
- A sequence of
hexadecimal numbers 0xXX.. 0xYY... shall be
mapped onto one Hex object. A single hexadecimal literal shall be
treated as a Hex object.
- A concatenation
operator ++ shall be defined among Hex objects.
Together with the new operator ++ we define a __concat__
protocol (
and
also an __rconcat__ protocol ) that can be bound to other
objects as well.
In order to implement those
requirements we have to understand EasyExtend fundamentals.
2.2
Grammars and Tokenizers
EasyExtend
descibes all syntax in the form of grammar files in EBNF notation. This
is convenient syntax to describe grammar rules for top down parsers.
EasyExtend uses plain EBNF i.e. there is no notation for dealing with
operator precedences and there are no parser actions: all
transformations are post processings on parse CSTs ( concrete
syntax trees ) or token streams created by a lexer.
Actually
the reader for grammar files in EasyExtend is realized as a
langlet, the grammar_langlet.
This illustrates another property of EasyExtend. It can be used to
define arbitrary languages, not just Python extensions. So the very
first language we consider is EBNF:
#
Grammars Grammar
file_input: ( rule | NEWLINE )*
ENDMARKER
rule: NAME ':' rhs NEWLINE
rhs: alt ( '|'
alt )*
alt: item+
item: '[' rhs ']' | atom [ '*' |
'+' ]
atom: '(' rhs ')' | NAME | STRING | '_'
|
Nonterminals
( grammar- or production rules ) like file_input, rule, rhs etc. are
usually written in lower case letters while terminals like NAME, NEWLINE etc.
are written in uppercase letters. Line comments are stripped. As an
excercise you can apply the grammar on itself.
2.3
Terminals and Terminals of Terminals
The rule descriptions
above can be found in the directory
EasyExtend/langlets/grammar_langlet/parsedef/Grammar
A complementary
file for terminal definitions in the dual branch
EasyExtend/langlets/grammar_langlet/lexdef/Token
This
highlights another property of EasyExtend which is new in version 3.0.
There is a symmetry between grammar rule definitions used for parsers
and grammar rule definitions used by lexers. EasyExtend 3.0 doesn't use
regular expressions to define lexical content. Instead EBNF grammars
are used to define the terminals of the parser as well. The logical
status of terminals / non-terminals becomes somewhat relative: the
non-terminals of the lexical grammar are terminals of the parser.
This
leads naturally to the question of a bottom line. In EasyExtend 3.0 the
terminals of the terminals are character-sets which are stated
explicitely. In future versions this might be extended
to characteristic functions of sets which leads to full UTF-8
support which is not yet available. So we have roughly following
hierarchy
TokenToken |
Character
Sets and functionally specified terminals. |
Token |
Terminals
of Grammar ( EBNF ) |
Grammar |
Production
rules of Grammar ( EBNF ) |
The
TokenToken objects have usually the form T_ANY, T_INDENT
... or A_WHITE,
A_LINE_END, A_DIGIT...
The T_Object terminals
are functionally specified and whereas A_Object token are
charactersets.
2.4 Grammar.ext
and Token.ext
A
fundamental Grammar file and a Token file can be found in the root
directory of EasyExtend. The Grammar file will always be the Grammar
file of the current Python distribution i.e. the Grammar file of Python
2.5. The Token file will characterize the lexical structure of Python
accordingly. You have to lookup them quite often but will never have to
change them.
In order to define new rules for
parsers and lexers the parsedef
and lexdef
directories of your langlet provide files Grammar.ext
and Token.ext.
Adding a new rule or modifying an existing rule always takes place in
one of these files. If you remove e.g. Grammar.ext and place a Grammar
file without the extension suffix in your parsedef directory EasyExtend
won't create a parser for an extension language but one that is exactly
suitable for the specified Grammar.
In the CST display each
nonterminal is formatted like:
|
just
renaming a Grammar.ext file to Grammar might not have an immediate
effect. That's because EasyExtend will only create a new parser when
the Grammar file is more
recent
than the file parse_nfa.py that contains the parser rule definitions.
Otherwise a new parser was created on each startup. You have to open
the Grammar file apply some irrelevant changes and save it. This will
modify the timestamp of Grammar and EasyExtend will respond. |
2.4.1
Token.ext for hex_langlet
We open Token.ext in the
hex_langlet/lexdef directory and add the following rule:
#
Token.ext of hex_langlet
PLUSPLUS: '++'
|
This is almost everything we have to do. However the PLUSPLUS
nonterminal must be hooked somewhow into the existing token grammar.
This is done by extending an existing rule.
#
Token.ext of hex_langlet
OPERATOR: OPERATOR_DEF |
PLUSPLUS
PLUSPLUS: '++'
|
The OPERATOR definition overwrites
the definition given in the fundamental Token file.
2.4.2
Grammar.ext for hex_langlet
Next we change Grammar.ext of our
langlet. First we define a new numbers non-terminal.
#
Grammar.ext of hex_langlet
numbers: NUMBER+
|
and then we hook it into an existing rule.
#
Grammar.ext of hex_langlet
atom:
'('
[testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`'
testlist1 '`' |
NAME | numbers | STRING+
numbers: NUMBER+
|
The new numbers non-terminal replaces the former NUMBER
terminal.
Otherwise the atom rule is just copied from the fundamental Grammar.
2.4.3
Operator precedences
Next
we want to place the '++' operator into Grammar.ext. Note that
you
don't have to use the notion PLUSPLUS but can use the content string
'++'
directly. EasyExtend will convert it into an appropriate numerical
encoding automatically.
|
Currently
EasyExtend doesn't produce a Warning or an Error when it finds a token
string which it can't assign to a non-terminal definition. This might
change in future.
|
For insertion of '++' we have to deal
with operator
precedences i.e. binding strengths of operators. In EasyExtend those
are directly expressed in the structure of the grammar.
Example: a
+ b * c
One can start with a rule that does not
respect binding strength.
expr:
term (( '+' | '*') term)*
term: NAME | NUMBER
|
and refactor it such that '*' binds stronger than '+'
expr:
term ('+' term)*
term: factor ('*' factor)*
factor:
NAME | NUMBER
|
A top-down parsers starts with expanding the term
non-terminal. It has the internal product structure. Only when the
expansion of the term is finished the parser proceeds by either
terminating the parse ( zero or more repetitions allow termination ) or
looking for a '+' operator and a subsequent term.
Following
general rule applies: the
lower the nonterminal in the rule hierarchy the stronger the bindings.
Now
lets refocus on '++'. It shall have a lower precedence than any
arithmetic operations and a higher precedence than logical operations
like and / or / not.
In the concrete Grammar file we
can hook it into the comparison rule:
...
test: and_test ('or' and_test)* | lambdef
and_test: not_test ('and' not_test)*
not_test: 'not' not_test | comparison
comparison: expr (comp_op expr)* | concat_expr
comp_op:
'<'|'>'|'=='|'>='|'<='|'<>'|'!='|'in'|'not'
'in'|'is'|'is' 'not'
expr: xor_expr ('|' xor_expr)*
...
|
This looks somewhat hackish but comparison is an appropriate
location for a hook of our new concat_expr with respect to operator
precedence. The complete Grammar.ext file will look like
#
Grammar.ext of hex_langlet
comparison: expr
(comp_op expr)* | concat_expr
concat_expr:
expr ('++' expr)+
atom: '('
[testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`'
testlist1 '`' |
NAME | numbers | STRING+
numbers: NUMBER+
|
Note that we used the rule form
concat_expr:
expr ('++' expr)+
|
instead
of
concat_expr:
expr ('++' expr)*
|
Using
the latter though wouldn't harm us either although there would be an
ambiguity now. The concat_expr rule has to be transformed anyway and if
there was a single expr
only we could wrap it into a comparison
node. By using the + repetition advice we can omit handling this
additional case. If you are otherwise worried about left factoring
issues, lookaheads, irresponsible backtracking costs etc. I recommend
reading the Trail documentation.
2.4.4
A second run
After defining the new grammar rules
we repeat our initial session. The trace has slighly changed:
C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python
run_hex_langlet.py
*** Modify
C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet\lexdef\lex_nfa.py
***
*** Modify
C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet\parsedef\parse_nfa.py
***
__________________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
__________________________________________________________________________________
hx>
0x00
0
hx> 0x00 0x00
Traceback (most
recent call last):
File
"C:\lang\Python25\lib\site-packages\EasyExtend\eeconsole.py", line 256,
in compile_cst
_code =
compile(src,"<input>","single", COMPILER_FLAGS)
File "<input>", line 1
0x00 0x00
^
SyntaxError: invalid syntax
hx>
|
The SyntaxError will now be
reported by the
compiler not the parser. This is consistent with our
expectations
because the parser knows the new syntax rules now and creates a parse
tree. The compiler however doesn't know how to deal with the new nodes.
This problem shall be addressed in the next section.
3.
Defining Behaviour
So
far we have just
added new syntax to Python. In the next step we have to transform
the parse tree to eliminate all nodes again we have
just
added. Sounds silly but the Python compiler doesn't understand
our extended parse trees for sure. Given a CSTExt
(
Concrete Syntax Tree = parse tree of
our langlet ) we have to produce a valid CSTPy
i.e. a Python parse tree. The complexity of the
transformation T: CSTExt -> CSTPy
depends entirely on the behaviour we want to express. The
transformation isn't always simple but fortunately there is already a
rich function library to settle upon for building transformations you
can use.
EasyExtend
2.0 had also a high level macro like transformer API. The
implementation was very complex and I dropped it in EasyExtend 3.0. A
hopefully much simpler implementation will likely be re-enter EE in a
later release. But even then it is worthwhile to learn transforming
CSTs
using just CST builder functions. This approach is lower level but more
flexible and more general. Note also that transforming CSTs is
quite comparable with transforming ASTs ( Abstract Syntax Trees ) given
the right API i.e. EasyExtends API :)
Like
adding new grammar
rules, transforming CSTs requires a good understanding of Pythons
grammar. The Grammar file is our central script and data sheet. Almost
everything
is derived from it: the parser, the CST builder functions and the CST
validators. We are done very soon with token definitions but we
always have to deal with the Grammar. Using EasyExtend for a while you
will
become very fluent in thinking in those grammar rules. It's a bit like
learning an own little programming language. Well, the grammar is a declarative
programming language.
3.1
Inspecting CSTs
You
won't get anything done in EasyExtend without CST inspections.
Unfortunately CSTs are huge and verbose even for quite small
expressions. You can always unparse
CSTs i.e. convert a CST back from tree structure into source code. You
can
mark certain nodes in CSTs for recovery purposes, you can
display
CSTs before and after transformations.You can check your
final CSTPy
against the Python Grammar before compilation. Despite all
this
debugging CSTs is not fun. Just don't say I didn't warn you.
3.1.1
Token Stream inspections
When
you keep an error message from the parser which obtains that a
certain parse rule can't be applied a good way to
start is
examining the tokenstream. For displaying token streams start your
session with the -t
option:
hex_langlet>python
run_hex_langlet.py -t
__________________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
__________________________________________________________________________________
hx>
0x00 ++ 0x01
[token>
----------------------------------------------------------------------.
Line
| Columns | Token
Value
| Token Name | Token
Id |
-------+---------+---------------------+---------------+--------------+
1
| 0-4 |
'0x00'
|
NUMBER
| 36354 --
2 |
1
|
5-7 |
'++'
| PLUSPLUS | 36437 --
85 |
1
| 8-12 |
'0x01'
|
NUMBER
| 36354 --
2 |
1
| 12-12
|
'\n'
| NEWLINE |
36356 -- 4 |
2
| 0-0 |
''
| ENDMARKER | 36352 --
0 |
----------------------------------------------------------------------'
<token]
1
hx>
|
The tokenstream is perfect.
Just note a curiosity
here. The Python interpreter will execute 0x00 ++ 0x01 as 0x00 +
(+0x01) which evaluates to 1. This happened because EasyExtend can't
compile CSTs in interactive sessions. This isn't so much a limitation
of EasyExtend but Python ( unless you prove the opposite ). So
EasyExtend has to unparse
the
transformed expression / statement first and sends the unparsed source
to the compiler. The unparsed source is a valid Python expression and
will therefore be executed.
This is the reason why
you can't rely on the console only. When you pass a file test.phx to run_hex_langlet.py
EasyExtend will pass the CST directly to the bytecode compiler which
will raise an error. But you can't rely on this positive behaviour on
the console as well. Just checkout expressions 1 ++ 1, 1+++1, 1++++1,
etc. on an ordinary Python prompt and on the hex_langlet prompt with
activated -t option.
Another important remark
about the
displayed table. In the Token Id column you see two numbers separetated
by two hyphens. These are node
ids.
A node id is a numerical value characterizing a terminal or
non-terminal uniquely within a langlet. The node id for NUMBER is 36354
in my
hex_langlet. The node id of NUMBER in your
hex_langlet might be entirely different. It doesn't matter. The
absolute value has no meaning at all. What matters are two things.
First it is the range
of node
id's. The range has always a width of 512. 256 node ids are reserved
for terminals and 256 are reserved for non-terminals. The range is
characterized by an offset. You can determine the offset by typing
LANGLET_OFFSET in your hx> console. In my case I get
This
number is always dividable by 512. The other number that matters is the
rest of division
by 512. This is the second number in the Token Id column: 36354 % 512 = 2, 36437 % 512
= 85 etc. These
numbers are so important because they are those the Python compiler
actually understands. Suppose you have a parse tree for a
normal
Python statement e.g. a
= 2+5
in the hex_langlet. The Python compiler can't do anything with the
parse tree because it can't dispatch the node ids to actions in the
compilation loop. What EasyExtend does is to project
the CSTHex onto a CSTPy
by taking the rest of division by 512 for each node id in CSTHex.
That's like shifting the CSTHex
by a constant amount. It also explains the constraints on the range:
256 + 256 is the partition the Python compiler uses for terminals +
non-terminals.
3.1.2
CSTs before
transformation
Inspecting
token streams is a very first step but usually not the only one. The
second step is to examine the parse tree just like it is produced by
the Trail parser. The option being defined is -b. b stands
for before
transformation.
hex_langlet>python
run_hex_langlet.py -b
__________________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
__________________________________________________________________________________
hx>
0x00 ++ 0x01
[cst-before>
file_input
-- S`36609 -- 257
stmt -- S`36618 -- 266
simple_stmt -- S`36619 -- 267
small_stmt -- S`36620 -- 268
expr_stmt -- S`36621 -- 269
testlist -- S`36678 -- 326
test -- S`36655 -- 303
...
concat_expr -- S`36693 -- 341
expr -- S`36661 -- 309
...
numbers -- S`36692 -- 340
NUMBER -- T`36354 --
2 L`1
0x00
PLUSPLUS -- T`36437 --
85 L`1
++
expr -- S`36661 -- 309
...
numbers -- S`36692 -- 340
NUMBER -- T`36354 --
2 L`1
0x01
NEWLINE -- T`36356 --
4 L`1
'\n'
ENDMARKER -- T`36352 --
0 L`2
''
<cst-before]
1
hx>
|
The tree is actually a little
bigger but some
nodes are omitted in the display because they somewhat obscure the
relevant information. We see the nodes we have defined : numbers, PLUSPLUS and concat_expr.
Those
nodes can be highlighted using the -m
option. The -m option has an argument that is a comma separated list of
node names.
hex_langlet>python
run_hex_langlet.py -b -m PLUSPLUS,numbers
_________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (
_________________________________________________________________________
hx>
0x00 ++ 0x01
[cst-before>
file_input
-- S`36609 -- 257
stmt -- S`36618 -- 266
simple_stmt -- S`36619 -- 267
small_stmt -- S`36620 -- 268
expr_stmt -- S`36621 -- 269
testlist -- S`36678 -- 326
test -- S`36655 -- 303
...
concat_expr -- S`36693 -- 341
expr -- S`36661 -- 309
...
numbers -- S`36692 --
340
<-------
NUMBER -- T`36354 --
2 L`1
0x00
PLUSPLUS -- T`36437 --
85
<-------
++
expr -- S`36661 -- 309
...
numbers -- S`36692 --
340
<-------
NUMBER -- T`36354 --
2 L`1
0x01
NEWLINE -- T`36356 --
4 L`1
'\n'
ENDMARKER -- T`36352 --
0 L`2
''
<cst-before]
1
hx> |
If
you want to display the elapsed node symbols use the
--full-cst
option additionally to the -b option.
3.1.3
CSTs after transformation
The next option we
demonstrate is the option -a. a
stands for after
transformation.
C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python
run_hex_langlet.py -a
__________________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
__________________________________________________________________________________
hx>
0x00 ++ 0x01
[cst-after>
file_input
-- S`257 -- 257
stmt -- S`266 -- 266
simple_stmt -- S`267 -- 267
small_stmt -- S`268 -- 268
expr_stmt -- S`269 -- 269
testlist -- S`326 -- 326
test -- S`303 -- 303
...
concat_expr -- S`341 --
341
<-------
expr -- S`309 -- 309
...
numbers -- S`340 --
340
<-------
NUMBER -- T`2 -- 2 L`1
0x00
CONCAT -- T`85 --
85
<-------
++
expr -- S`309 -- 309
...
numbers -- S`340 --
340
<-------
NUMBER -- T`2 -- 2 L`1
0x01
NEWLINE -- T`4 -- 4
L`1
'\n'
ENDMARKER -- T`0 --
0 L`2
''
<cst-after]
1
hx> |
All
symbols that were not transformed are highlighted. This is an
indication that this CST won't generally compile and you shall read the
arrow markers as warnings.
3.1.4
CST validation
The next option is the -v
option. Activate
the -v
option when you want to validate the final CSTPy
against the Python grammar. Using option -a is useful when you want to
show all nodes that are yet untransformed. Using -v does not display
all nodes that go wrong but it can detect more subtle errors in the
target CST like wrong tree structures.
C:\lang\Python25\Lib\site-packages\EasyExtend\langlets\hex_langlet>python
run_hex_langlet.py -v
__________________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
__________________________________________________________________________________
hx>
0x00 ++ 0x01
[cst-validation-test>
file_input
-- S`257 -- 257
stmt -- S`266 -- 266
simple_stmt -- S`267 -- 267
small_stmt -- S`268 -- 268
expr_stmt -- S`269 -- 269
testlist -- S`326 -- 326
test -- S`303 -- 303
or_test -- S`304 -- 304
and_test -- S`305 -- 305
not_test -- S`306 -- 306
comparison -- S`307 -- 307
concat_expr -- S`341 -- 341
<==== No grammar
conformance!
---->
Expected node(s): { expr -- S'309 }
expr -- S`309 -- 309
xor_expr -- S`310 -- 310
and_expr -- S`311 -- 311
shift_expr -- S`312 -- 312
arith_expr -- S`313 -- 313
term -- S`314 -- 314
factor -- S`315 -- 315
power -- S`316 -- 316
atom -- S`317 -- 317
numbers -- S`340 -- 340
NUMBER -- T`2 -- 2 L`1
0x00
CONCAT -- T`85 -- 85
L`1
++
expr -- S`309 -- 309
xor_expr -- S`310 -- 310
and_expr -- S`311 -- 311
shift_expr -- S`312 -- 312
arith_expr -- S`313 -- 313
term -- S`314 -- 314
factor -- S`315 -- 315
power -- S`316 -- 316
atom -- S`317 -- 317
numbers -- S`340 -- 340
NUMBER -- T`2 -- 2 L`1
0x01
NEWLINE -- T`4 -- 4
L`1
'\n'
ENDMARKER -- T`0 --
0 L`2
''
csttools.py:320:
GrammarConformanceWarning: CST doesn't match target langlet grammar.
<cst-validation-test]
1
hx> |
These
aren't news for us. We haven't yet transformed the CST and don't expect
the projected CSTPy
being conform with the grammar. Note that the validation process is
based on Trail. Using the Python grammar and the input node comparison the
follow set of comparison
contains only the node { expr -- S'309 }.
Checkout the CST validator
when typing 1, '1' etc.
3.1.5
Python source code after transformation
Each CST can be unparsed. You might display the unparsed CST using the
option -p
for python.
Restart the console and take
a look:
EasyExtend\langlets\hex_langlet>python
run_hex_langlet.py -p
__________________________________________________________________________________
hex_langlet
On
Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit
(Intel)]
__________________________________________________________________________________
hx>
0x00 ++ 0x01
[python-source>
0x00 ++ 0x01
<python-source]
1
hx> |
3.1.6
Summary
We summarize shortly the options we have
used. Of course you can combine any of them arbitrarily.
-a |
show
CST after transformation |
-b
|
show
CST before transformation
|
-p |
show unparsed
python code after transformation |
-t
|
show
token sequence
|
-v |
CST validation |
-m
node_name[,node_name]* |
mark
nodes in CST with node(s) name given by the parameters of the -m
option. If multiple node names are used they have to be comma
seperated.
|
Exercise:
Replace
the numbers non-terminal by a more specific hexnumbers non-terminal
#
Grammar.ext of hex_langlet
...
atom:
'('
[testlist_gexp] ')' | '[' [listmaker] ']' | '{' [dictmaker] '}' | '`'
testlist1 '`' |
NAME | hexnumbers | NUMBER
| STRING+
hexnumbers: Hexnumber+
|
and checkout what
happens. Is hexnumbers
always selected when building expressions 0x00, 0x00 ++
0x01, 0x00 0x01
... ?
4.
The LangletTransformer
4.1
Specify
the numbers
transformation
A numbers
nonterminal has always the structure
[symbol.numbers, [token.NUMBER, "0x00", line-no], [token.NUMBER,
"0x01", iine-no],...].
The
transformation of numbers
is specified by those two rules:
- When
a single NUMBER
terminal is found and
the value starts with
"0x"
it shall be replaced by a Hex object. Otherwise it shall not be
translated ( e.g. "1" shall not translated into a Hex ).
- When a blank separated sequence of NUMBER terminals of at
least length = 2 is found each terminal value
shall be translated into a Hex object independent of all prefix
considerations.
4.2
The langlet.py module
The canonical langlet.py
layout is rather sparse. The conf module contains fixed langlet
specific settings and all imports other than csttools. The
transformation of CSTHex is
encoded in the LangletTransformer
class of langlet.py.
We ignore the other transformers. Finally
there is the __publish__ attribute. Each name found in the __publish__
list is the name of a langlet attribute. Each of the corresponding
attributes is added to the builtin namespace. This makes them truly
visible globally.
from
conf import*
# langlet
specific modules and
settings
from EasyExtend.csttools import* # tools used to
manipulate
CSTs
class LangletImporter(eeimporter.Importer):
'''
Defines langlet specific
import behaviour.
'''
class
LangletTokenizer(eetokenizer.Tokenizer):
'''
Defines langlet specific
token stream processor.
'''
class LangletTransformer(Transformer):
'''
Defines langlet specific transformations
'''
__publish__ = []
|
In the first step we add a transformation
rule to our
LangletTransformer
which is a method named after the terminal or nonterminal that shall be
transformed. To mark it as a transformation rule it has to be decorated
with the @transform decorator. Uncommenting the @transform decorator
makes it invisible for node dispatch. We call methods decorated by
@transform also node
handlers. Each node handler keeps a single argument.
class
LangletTransformer(Transformer):
'''
Defines langlet specific transformations
'''
@transform
def numbers(self, node):
'''
numbers: NUMBER+
'''
|
As an additional aid we insert the grammar
rule according
to the node handler as a comment. This is just a
convention and
has no functional meaning.
4.3
Find
subnodes
Next we check the sequence length of NUMBER
terminals in numbers.
If
the length equals 1 the prefix of the
first NUMBER
will be inspected according to our spec.
class
LangletTransformer(Transformer):
'''
Defines langlet specific transformations
'''
@transform
def numbers(self, node):
'''
numbers: NUMBER+
'''
_numbers = find_all(node,
token.NUMBER) # find_all returns all NUMBER nodes
found as
# subnodes of node
if len(_numbers) == 1:
if
_numbers[0][1].startswith("0x"): #
take NUMBER and inspect the shape
# implement case
of
0xYYYY...
else:
# implement case of decimal or octal number
else:
#
implement case of sequence of numbers
|
Searching for subnodes of a node is done with the functions find_node
and find_all.
-
find_node( |
cst,
node_id, level=10000) |
- Seeks for a node
with id = node_id
in cst. It
returns the first
node found in a search is depth-first search or None.
-
The
maximal depth of the searched cst is constrained by level which has a
high value by
default. If level = 1 only
- the immediate subnodes
of cst will be inspected.
-
find_all( |
cst,
node_id, level=10000) |
- Seeks for all
nodes with id
= node_id
in cst.
Return a list of these nodes.
If no node is found the empty list is returned. The search mode is
depth-first. The
maximal depth of the searched CST is constrained by level which has a
high value by
default. If level = 1 only the immediate subnodes of cst will be
inspected.
Most of the time we need just a few search operations to keep all
important data to start or constrain a transformation.
4.4
Create new CSTs
Building new CST nodes using
elementary CST builder functions is possible. In fact all CST synthesis
actions relies on them. Using them isn't concise though. EasyExtend
contains another higher level API of functions starting with a big
"CST_". They are defined in cstgen.py and they provide simplifications
when creating new nodes.
if
len(_numbers) == 1:
if
_numbers[0][1].startswith("0x"):
return CST_CallFunc("Hex",[_numbers[0]]) # returns
a node
of type symbol.power.
else:
return
atom(_numbers[0])
#
do not
wrap number by Hex and substitute
else:
#
the
numbers nonterminal by an atom
# implement case
of sequence of
numbers
|
The
CST_CallFunction can also be applied to argument lists.
if
len(_numbers) == 1:
if
_numbers[0][1].startswith("0x"):
# keep a NUMBER and inspect its form
return CST_CallFunc("Hex",[_numbers[0]]) # returns
a node
of type symbol.power.
else:
return
atom(_numbers[0])
# do not wrap number by Hex and substitute
else:
return
CST_CallFunc("Hex",_numbers)
|
The _numbers are passed as arguments to the Hex constructor.
Exercise:
use option -p
to inspect the translated code. Also checkout options -b, and -a to get
a feeling for raw and transformed CSTs. Python will obviously complain
about not finding Hex. That's o.k. at this stage since we do not have
added yet functionality to our transformation. This will be done in the
next step.
4.5 The
Hex class
We implement a rudimentary Hex class.
Besides __init__ and __repr__ only the methods __concat__ and
__rconcat__ which implement functinality of the PLUSPLUS
terminal ( the
"++" operator ) are defined. Additional
methods can be added on your own behalf.
class
Hex(object):
def __init__(self, *args):
self.bytes = []
for arg in args:
if
isinstance(arg, (int,long)):
self.bytes.append(self.create_from_int(arg).bytes)
elif
isinstance(arg, Hex):
self.bytes.append(arg.bytes)
else:
raise TypeError,"cannot convert to Hex object: '%s'"%arg
self.bytes =
sum(self.bytes,[])
@classmethod
def create_from_int(cls, i):
assert
isinstance(i,(int,long)), "type of i is %s"%type(i)
h = Hex()
h.bytes = []
v = i
if v == 0:
h.bytes = [0]
else:
while v:
v,r = divmod(v,0x100)
h.bytes.insert(0,r)
return h
def __len__(self):
return len(self.bytes)
def __concat__(self, other):
if isinstance(other, Hex):
h =
Hex()
h.bytes = self.bytes+other.bytes
return h
else:
h =
Hex()
h.bytes = self.bytes+Hex(other).bytes
return h
def __rconcat__(self, other):
return
Hex(other).__concat__(self)
def
__repr__(self):
return "
".join(["0x%02X"%item for item in self.bytes])
|
4.5.1
Define
operator_concat
The
Hex object implements only the protocol methods __concat__ and
__rconcat__. Now we have to define the transformation that is called
whenever a '++' operator is detected.
def
operator_concat(*args):
'''
Defines behaviour of
'++'.
'''
a,b = args[:2]
if a in (basestring, list, tuple):
res = a+b
else:
if hasattr(a,"__concat__"):
res
= a.__concat__(b)
elif
hasattr(b,"__rconcat__"):
res
= b.__rconcat__(a)
else:
raise TypeError, ":unsupported operand type(s) for ++: '%s' and
'%s'"%(type(a), type(b))
if args[2:]:
return
operator_concat(*(res,)+args[2:]) # recursion on
sequences
return res
|
In the final
transformation step we have to implement the transformer of the
concat_expr
node.
class
LangletTransformer(Transformer):
'''
Defines langlet specific transformations
'''
@transform
def concat_expr(self, node):
'''
comparison: expr
(comp_op expr)* | concat_expr
concat_expr: expr ('++' expr)+
'''
exprs = find_all(node, symbol.expr, level = 1)
# extract expressions
concat_call =
CST_CallFunc("operator_concat", exprs) # operator_concat(exp1, ...,
expk)
return
any_node(concat_call, symbol.comparison)
# wrap concat_call back into a
# comparison
node
|
Exercise: we
used any_node that wrapped a node of type symbol.power into
symbol.comparison. Checkout other wrapper functions like any_test and
any_expr and observe what happens.
4.6
The
__publish__
attribute
Whenever a list of numbers or a hexadecimal number
is typed into a hex_langlet script it will be immediately converted
into
a Hex object and whenever a "++" operator is applied it will be
converted
into an operator_concat
function call. But neither Hex
nor the
operator_concat
function are globally visible unless imported
explicitely. To ensure global visibility without module import of the
language we intend to use we can put Hex and
operator_concat
into root namespace
__builtins__. To simplify this aspect even more each langlet has a
__publish__
attribute that keeps a list of names. In case of the hex_langlet we
define:
__publish__ = ["Hex", "operator_concat"]
This
is not unlike the optional __all__
module attribute that regulates publically visible names on module
imports.
5.
hex_langlet
modules
We have completed our langlet
definition.
Now
we want to write scripts using the specific hex_fiber syntax and
execute them as fiber modules.
#
module demo.phx
# ----------------
def add_padding(arg, pad = 0x00, size = 8, right_pad = True):
res = arg
for i in range(size-len(arg)%size):
if right_pad:
res = res
++ pad # right padding
else:
res = pad
++ res # left padding
return res
if __name__ == '__main__':
print add_padding( 0x01 0x02 0x03 )
print add_padding( 0x01 0x02 0x03 , pad
= 0x80)
print add_padding( 0x01 0x02 0x03 , pad
= 0x80,
right_pad = False)
|
The
function add_padding adds leading or trailing bytes to a Hex
object. Now we call demo.py from fiber.py.
c:\lang\Python24\Lib\site-packages\EasyExtend\fibers\hex_fiber>python
fiber.py demo.phx
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00
0x01 0x02 0x03 0x80 0x80 0x80 0x80 0x80 0x80 0x80 0x80
0x80 0x80 0x80 0x80 0x80 0x01 0x02 0x03
|
There
is something remarkable about the result, which might not be too
obvious on the first sight.
The if
__name__ == '__main__' :... section of demo.py
is called although demo is not __main__
in this case but langlet!
The
answer can be found when we look at the translated code:
hex_langlet>python
run_hex_langlet.py -p --re-compile demo.phx
[python-source>
import EasyExtend.langlets.hex_langlet as __langlet__
import
EasyExtend.eecommon ; EasyExtend.eecommon.init_langlet(__langlet__)
def add_padding(arg, pad = Hex( 0x00 ) , size = 8, right_pad = True):
res = arg
for i in range(size-len(arg)%size):
if right_pad:
res
=operator_concat(( res) ,( pad ) ,)
else:
res
=operator_concat(( pad) ,( res ) ,)
return res
def __like_main__ ( ):
print add_padding( Hex( 0x01, 0x02, 0x03
))
print add_padding( Hex( 0x01, 0x02, 0x03
) , size =
11, pad = Hex( 0x80 ))
print add_padding( Hex( 0x01, 0x02, 0x03
) , pad =
Hex( 0x80 ) , right_pad = False )
if __name__ == '__main__':
__like_main__ ( ) ;
<python-source]
0x01 0x02 0x03 0x00 0x00 0x00 0x00 0x00
0x01 0x02 0x03 0x80 0x80 0x80 0x80 0x80 0x80 0x80 0x80
0x80 0x80 0x80 0x80 0x80 0x01 0x02 0x03
|
During
code transformation the __like_main__()
function has been created. The idea of __like_main__ is that it can
be called from the langlet and executes all statements in the function
body that are otherwise defined in the if
__name__ ==
'__main__':
block.
Two additional remarks.
1.
We introduced a new command line option --re-compile.
EasyExtend doesn't translate the phx module again if the compiled pyc
version is more recent. So we have to enforce re-compilation to
highlight aspects of the transformation process.
2.
The top level import statements are generated as well. Only with them
it was possible to run demo.pyc directly from the command line:
hex_langlet>
python
demo.pyc
Since demo.pyc can be executed directly from Python you can
also import demo from arbitary Python modules.
6.
Epilogue
Walking through the hex_langlet example
was
rather straightforward if one knows the system. Struggling with syntax
definitions or wrong transformations is the norm though. I want to
mention just briefly a few more commmand line options that aid the
development process and an alternative to the @transform
decorator we used in the LangletTransformer.
Two
additional command line options are
--dbg-lexer |
show
lexer trace |
--dbg-parser
|
show
parser trace
|
My recommendation is to use them as a last
resort in a normal development session. They show some aspects of the
interanl working of lexer and parser and they helped me identifying
errors in Trails NFA expansion mechanism. So they are somewhat
equivalent to core dumps or assembly inspection.
More
useful in casual work is the replacement of the @transform
decorator by the @transform_dbg
decorator. The @transform_dbg takes
a string argument as a parameter. This argument contains two letter
commands describing various display options.
@transform_dbg("ci
co si") # ci = input cst
# co = output cst
# si = input source
The
complete overview of transform_dbg can
be either found in the API docs, in the function definition
(
EasyExtend/eetransformer.py
) or here.
|