2009-03-01
Trails Of EasyExtend
This News Site has turned into an almost regular but somewhat
impoverished blog.
I've taken advantage now from wordpress and started a blog called Trails Of EasyExtend. So
nothing is lost and the fun can continue.
2009-02-04
The postlexer
Once again C++ parsers are in
the news. Well, these are old news and the quoted article might be
hopelessly dated. However redditors provide useful links to projects
like Pork.
My impression of this article is that the author is somewhat confused.
He states that parsing C++ is hard because it isn't LALR and later
withdraws his jugdement and says syntactical analysis is easy though
not trivial. Semantic analysis is the hard stuff. But the parsing
process depends on semantic analysis. One cannot parse C++ without
knowing whether A * B denotates a product of values or a type
declaration. There might be other such cases and to know about them
would have been invaluable information.
I do not want to consider the C preprocessor here and just expect that
the C++ source being parsed is the output of the
preprocessor. EasyExtends approach to deal with context sensitivity is
called parser sandwiches. A
parser sandwich has three layers.
- A lexer. Reads a source string and outputs a token stream.
A lexer can use a context free grammar or some regular expression
engine. The basic assumption is that it is context free.
- A postlexer. Reads a token stream and returns a possibly
modified token stream.
- A parser. Reads a token stream and creates a parse tree.
The parser shall be context free.
Unlike lexer and parser the postlexer is context and language
dependent. It is the single place where EE processes significant
whitespace for example. The INDENT / DEDENT / NEWLINE token are not the
product of the lexer but the postlexer. The postlexer is freely
programmable in imperative style using Python.

Parser sandwich
2008-11-26
Here is a PDF
presentation in which computing scientists Bernard Stepien and Liam
Peyton compared Python with TTCN-3. TTCN-3 stands for Testing and Test Control Notation
and it is domain specific language specifically designed for writing
tests in the domain of telecommunication. In almost any category poor
Python
just loses. There are some annoying aspects in the presentation like
containing not even conistent Python code at several places ( examples:
on slide 42 an "else if" construct is used instead of the
correct "elif". On slide 13 fields are not assigned to self which leads to code that will
just fail to run ). But leaving this
carelessnes aside there is something that estranges an experienced
Python programmer. The way how Stepien and Peyton compare TTCN-3
language constructs to those of Python is odd. Take slide 12 for
example:
TTCN-3
templates to Python
• TTCN-3 templates could be mapped to
Python object instances.
• However, there are serious limitations using the above technique with
Python.
• Python objects are practically typeless
|
Then the presentation goes over 18 slides just to explain how bad plain
Python classes are for the same purposes as TTCN-3 templates. They are
correct about it but who cares? If one wants to reproduce the behaviour
of
templates one creates a particular type in Python.
In the rest of the article I demonstrate how to create TTCN-3 like
templates in Python
class
Template(object):
def __init__(self):
self._frozen = False
self._fields =
[]
def __setattr__(self, name, value):
if name in ('_fields',
'_frozen'):
object.__setattr__(self, name, value)
return
if self._frozen:
raise RuntimeError("Cannot set attribute value on a frozen template")
if name not in self._fields:
self._fields.append(name)
object.__setattr__(self,
name, value)
def __eq__(self, other):
if
len(self._fields)!=len(other._fields):
return False
else:
for
f_self, f_other in zip(self._fields, other._fields):
val_self = getattr(self, f_self)
val_other = getattr(self, f_other)
if val_self != val_other:
return False
return True
def __ne__(self, other):
return not self.__eq__(other)
def freeze(self):
self._frozen = True
def clone(self):
T = Template()
T._fields = self._fields[:]
for field in T._fields:
setattr(T, field, getattr(self, field))
return T
|
The basic idea of this class called Template
is to manage an internal _fields attribute that contains the attribute
names created by __setattr__ dynamically. This is done to preserve the
order of the fields on comparison. In the current implementation there
aren't any fields marked as optional
and it will take additional effort to introduce them and modify __eq__.
But it is easy to demonstrate the behaviour of wildcards. First we have
to introduce another class for this purpose:
class
Wildcard(object):
def __eq__(self, other):
return True
def __ne__(self, other):
return False
wildcard = Wildcard()
|
Whatever object X
is compared to wildcard
it is always X == wildcard ==
True. So we can create
templ_1
= Template()
templ_1.field_1 = wildcard
templ_1.field_2 = "abc"
templ_2 = Template()
templ_2.field_1 = "xyz"
templ_2.field_2 = wildcard
print templ_1 == templ_2 # -> True
|
A Pattern class can be created the same way as the Wildcard class
class
Pattern(object):
def __init__(self, regexp):
self.pattern =
re.compile(regexp)
def __eq__(self, other):
try:
return bool(self.pattern.match(other))
except TypeError:
return True
def __ne__(self, other):
return not self.__eq__(other)
|
An '==' comparison of a Pattern instance with a string will match the
string against the pattern and yield True if the string could be
matched and False otherwise.
So it's perfectly possible to reproduce most aspects of TTCN-3
templates by just a few lines of Python code. On the downside of TTCN-3
this isn't possible the other way round. No matter how hard you work in
TTCN-3 it is impossible to create new types showing interesting
behaviour.
2008-11-24
EasyExtend and Jython
I checked out EasyExtend with Jython. I thought the only relevant
adaption that has to be made is consequently emitting Python source
after transformation since I'm not inclined to shoehorn CSTs into
Jythons ( or ANTLRs ) internal AST format. But then I got:
java.lang.ClassFormatError:
Invalid method Code length 124516 in class file parse_nfa$py
This stems from an internal limitation of the JVM that permits method
sizes of at most 64K. Welcome to 1980! Of course EasyExtend doesn't
define methods of this size. It's just that the parse-tables ( nfas )
exceed those limits and Jython maps those tables onto methods. I dare
to say that even breaking the whole table into parts doesn't fix the
problem because expansion leads to huge single NFAs occasionally. I'm
not going to fix this and therefore this is the end of the story of EE
on the JVM.
2008-11-17
Recursive embeddings
Recursive embeddings are the single most challenging aspect of
EasyExtend and Trail. It's also one of the topics that advances top
down parsers in general. So what comes might be an original
contribution to the practice of building parser generators.
A recursive rule contains its own rule-symbol also in the rule defining
part on the RHS. This isn't always a problem. It is one only in case of
a conflicting decision. Take a look at the following rule:
There is no conflict that has to be resolved. Same goes with
of course which has the exact same structure. But now combine them into
C:
A | B
A: '[' A ']' | 'a'
B: '[' B ']' | 'b'
|
and we notice an ambiguity. Finding a left bracket "[" in the token stream
gives no indication whether A or B is the correct rule that has to be
selected to build the parse tree. The only strategy used by Trail to
resolve ambiguities is embedding/expansion. But this won't help here
because expansion was cyclic and we'd run into an infinite expansion.
All of this happens on the level of NFAs and we need to examine one
before we can proceed:
(A,
0, A): [('[', 1, A), ('a', 4, A)],
(A, 2, A): [(']', 3, A)],
('[', 1, A): [(A, 2, A)],
(']', 3, A): [(None, '-', A)],
('a', 4, A): [(None, '-', A)]}],
|
The start symbol is (A, 0, A). After start the NFA branches into a
bracket '[' and a letter 'a'. The bracket trace leads to (A, 2, A).
Since A is found in the first entry of (A, 2, A) the NFA is called
recursively. The NFA only terminated when an EXIT symbol (None, '-', A)
is entered.
When we try to embedd this NFA in C together with the NFA of B we run
into a clash because we have [(A, 2, A) , (B, 2, B)] among the
transitions. Clashes can be resolved by means of expansion but only if
we don't run into cycles as mentioned above. But in this case we have a
cycle. So what can we do? I'll describe no an NFA transformation and
explian why it works in the context of Trail:
First we split (A,2,A) into a pair (A,'(', 2, A) , (A, ')', 2, A) which
affects the NFA in the following way:
(A,
0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A):
[(']', 3, A)],
(A, '(', 2, A):
[(A,
')', 2, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [(None, '-', A)],
('a', 4, A): [(None, '-', A)]}],
|
Now we give an interpretation of (A,'(', 2, A). It shall act like a
goto to the start of the NFA. This is a bit like a tail recursion
elimination. Now it's not the start symbol but the first set it shall
link to.
(A,
0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A):
[(']', 3, A)],
(A, '(', 2, A):
[('[',
1, A), ('a', 4, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [(None, '-', A)],
('a', 4, A): [(None, '-', A)]}],
|
The complementary label replaces now the EXIT symbol.
(A,
0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A):
[(']', 3, A)],
(A, '(', 2, A):
[('[',
1, A), ('a', 4, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [(A, ')', 2, A)],
('a', 4, A): [(A, ')', 2, A)]}],
|
Now we have lost the EXIT symbol but it's clear that it can only be the
successor of the clsoing label:
(A,
0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A):
[(']', 3, A), (None, '-', A)],
(A, '(', 2, A):
[('[',
1, A), ('a', 4, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [(A, ')', 2, A)],
('a', 4, A): [(A, ')', 2, A)]}],
|
This is already the final NFA. But there is something wrong here. When
you look at the transition [(']', 3, A),
(None, '-', A)] it
cannot be a real alternative because whether you choose ] or None
depends how often you entered (A,'(', 2, A) compared to (A, ')',
2, A). So we need a counter that matches opening against closing
braces. This isn't of course hard in itself but notice that at each
point in the automaton we can branch into different labels so we need a
counter for each trace. Fortunately it's easy in Trail because it uses
a spaghetti- or cactus
stack to store labels anyway. So all we need to do when we enter
the (A, ')', 2, A) is whether it can be balanced or not. If
count((A, ')', 2, A)) >
count((A, '(', 2, A))
then (None, '-', A ) is selected otherwise the parsing process
continues selecting (']', 3, A).
Grammars that are accepted by Trail
Here are some examples:
G:
A | B
A: "(" A ")" | a
B: "(" B ")" | b
|
G:
A | B | x
A: a b G* a
B: a c G* a
|
The latter is particular interesting because it describes mutual
recursion. Not only are A and B expanded in G but also G is expanded in
A and B.
Limitations
There is not yet a clear scope of applicability nor an analytical
treatment / classification. Some grammars are clearly rejected while
others might just not work right now. They can also be classified
according to the exception they raise.
Direct left recusion
A classical example for direct left recusion looks like
Those grammars are most easily detected and transformed. In this case
one better writes:
Many recursion locations
With "many recursion locations" I mean rules like this:
X
: "{" X "}" X | "{" "}" | "a" | "a" X
|
The rule itself isn't problematic at all but when you insert it into
G:
A | B
A: "(" A ")" | X
B: "(" B ")" | b
|
It produces wrong behaviour. Labels like (A, '.', IDX, A) are created
which are an abomination.
Unstoppable expansion cycles
In this case it is impossible to determine the next nid selection. I
implemented just a dumb overflow criterion that cancels the expansion
process. This criterion is related to the depth of the grammar. The depth of
the grammar is measured like this:
- depth(N)
= max(depth(K)) K in FirstSet(N)
- depth(T)
= 0 for T is terminal
- depth(Grammar) =
max(depth(N)) N is a rule nid
Notice that by computing the depth of the node we exclude immediate
left recursion in practice: an immediate left recursive rule is in its
own first set and we don't get a result. In case of recursive
embeddings the depth of the Grammar might be increased. If this
increase exceeds 10 times the depth of the original grammar an
exception is raised. The following grammar is not admissable according
to this rule:
GA:
GA1 | GA2
GA1: 'a' GA2
GA2: 'a' GA1 | 'b'
|
Summary
I'd like to have a better analytical insight into the recusive
embedding process. So far it works for some simple cases. Some of them
are even convenient. But we can give other simple examples of recusive
grammars that fail. I'm not really able to see a universal pattern
right now and I'm not even sure this has to be deeply investigated?
It's a good leap forward after all.
2008-10-08
About some grammars Trail can now deal
with in a sane way
Here is one:
R:
A | B
A: C A | D
B: C B | E
|
Expansion of A in R causes now a LeftRecursionWarning and
we get a final RuntimeError exception
RuntimeError:
More than ten expansion warnings issued. Expansion is terminated!
Solving the problem was finally easy. When a label L = (A, i, A) is
expanded in R one among the labels that are embedded has the form
(A, shift(j), A). In a next step this we expand this label and this
yields a new one (A, shift(shift(j)), A) . So all we need to cancel the
infinite expansion is to set a threshold for the index value i in (A,
i, A). Since a rule can be quite big the value shall not be too small,
e.g. 100. This threshold will be passed very soon.
2008-10-07
We also know there are known
unknowns; that is to say we know there are
some things we do not know. But there are also unknown unknowns - the
ones we don't know we don't know.
Donald Rumsfeld, 2002-02-12
About some grammars Trail cannot deal with in a sane way
Here is one:
R:
A | B
A: C A | D
B: C B | E
|
Trail will try to expand A or B in R and will expand forever because it
will never run into a left recursion of R which would lead to immediate
termination of expansion. Having rules that are right-recursive ( both
A and B ) alone isn't sufficient to avoid trouble. We could transform
rules into FunNets individually but then there is no hint on how to
expand them which is still necessary.
2008-10-03
With the use of cyTrail + psyco and caching techniques I took a little
pressure from Trails performance. I'm still dissatisfied with lexer
performance and I've talked about ideas on
how to resolve performance issues but I'm not anyhow close to an
implementation.
Left recursion again...
It buggs me that Trail has obvious limitations in parsing expressions
according to very simple rules. For example the following rule
leads to warning messages when Trail produces NFAs. This is because S
has to be expanded within itself and left recursion is prohibited.
A practical example for what can't be parsed using Trail can be given
as well:
xml_element:
'<' qname '>' ( ANY | xml_element )* '</' qname '>'
|
Advancing Trail
Suppose we eliminate S on the RHS of the rule in a state transition
diagram. A first attempt to draw the diagram might look like this:

It is insufficient though because sequences like
a b a b a c None
a b a c a c None
are valid while they are clearly not recognized by the grammar of rule
S.
In a second attempt we draw a two dimensional diagram by cloning the
basic sequence and linking the sequences

This spatial diagram precisely expresses what we were looking for. But
we can't realize the infinite diagram by a fixed set of nodes of the
kind we've used in a Trail NFA description. So we have to generalize
Trail NFAs by a network of connected functions. The up/down shift in
levels of the diagram is expressed using a single parameter.
Ansatz:
f0
= lambda k: (('a',0,'S'), [(f1, k)])
f1 = lambda k: (('b',1,'S'), [(f2, k),(f0,k+1)])
f2 = lambda k: (('a',2,'S'), [(f3, k)])
f3 = lambda k: (('c',3,'S'), [(f4, k)])
f4 = lambda k: (((None, '-', 'S'),[]) if k == 0 else
[('a',2,'S'),(f3,k-1)])
|
Each lambda expression takes a single parameter and returns a Trail NFA
label and a list of lambda expressions and parameters that have to be
passed to continue. The list of successor functions can be empty and
cause the moves coming to halt.
| a b a b a c
a c None |
fo(0)
|
('a',
0,'S')
|
(f1,
0)
|
a
|
f1(0)
|
('b',1,'S')
|
(f2,0),(f0,1)
|
b
|
f2(0),f0(1)
|
('a',0,'S'),('a',2,'S') |
(f1,1),(f3,0)
|
a
|
f1(1),f3(0)
|
('b',1,'S'),('c',3,'S') |
(f2,1),(f0,2),(f4,0)
|
b
|
f2(1),f0(2)
|
('a',0,'S'),('a',2,'S') |
(f1,2),(f3,1) |
a
|
f1(2),f3(1)
|
('b',1,'S'),('c',3,'S')
|
(f2,2),(f0,3),(f4,1)
|
c
|
f4(1)
|
('a',2,'S') |
(f3,0)
|
a
|
f3(0)
|
('c',3,'S') |
(f4,0)
|
c
|
f4(0)
|
(None,'-','S')
|
|
None
|
Open problems
Let be
with FirstSet(A) /\ FirstSet(a) != {}. How are expansion and the NFA
-> FunNet transformation related to each other? The whole point of
the FunNet transformation is that we can't apply expansion on S! A
possible appraoch is to solve a specialized problem first, perform the
necessary substitutions and finally transform the NFA into a FunNet.
Two questions
- Is there are universally applicable specialization?
- How has the subsitution to be performed with the expanded
NFA s.t. the original general form can be restored?
2008-09-19
Langscape
EasyExtend 3.1 will contain a new package called langscape and a new module called
eelangscape.py placed in the root directory. With langscape it will be
possible to create new language features on the fly without prior
definition of Grammar.ext and Token.ext files. This departure from the
classical EE notwithstanding the langscape operators and API rely
entirely upon the established infrastructure. The most simple part has
been finished already and it is possible to roughly the following
from
EasyExtend.eelangscape import SyntaxExtension, binding, binding_strength
def concat_handler(a, b):
'''
Define concatenation between two objects
'''
if isinstance(a, int) and
isinstance(b, int):
return int(str(a)+str(b))
elif isinstance(a, list) and
isinstance(b, list):
return a+b
elif isinstance(a, tuple)
and isinstance(b, tuple):
return a+b
raise TypeError("Arguments
cannot be concatenated")
syntax = SyntaxExtension()
syntax.add_binary_operator("++", "PLUSPLUS", binding( binding_strength
== '_ + _'), concat_handler)
|
That's it. It doesn't
need more to define a binary ( infix ) operator anymore.
It's not yet clear to me yet how to advance this simplicity beyond the
most simple case ( shown here ). This one works well because we just
have to replace expressions of the form "a ++ b" by
"concat_handler(a, b)" expressions. This can be done uniformly for all
binary operators. But when facing more complex expressions like
statements I wonder if we end up defining grammar rules and performing
CST transformations again just in a less general setting? Maybe one way
around this is to get into a PHP-ish API: a large flat function list
that covers most of the functionality.
2008-09-18
Michal Wallace mentioned EasyExtend in his blog
post among a few other quite complex and more famous Python
projects with a particulary sucking documentation. Well, I just want to
say that documentation evolves as projects mature. I don't spend too
much time into refining docs when I'm not confident with the system and
also do major re-workings and perspective changes.
2008-09-13
P4D Langlet 1.2.4 released
P4D 1.2.4 contains some cleanups regarding subelement rules. Checkout
the rewitten section 2.7 in the P4D docs
that deals with those issues.
Flexutils
In the meantime I'm busy learning Adobe Flex and work through it using
P4D. Not sure who else refuses to use Adobe FlexBuilder when working
Flex? Flex and particular MXML invites to use an IDE and the single one
that is capable to deal with MXML properly stems from Adobe. Clever. I
want to use Flex like a scripting language containing only P4D elements
and ActionScript code.
Example 1: consider the HelloWorld
MXML snippet. Store it as HelloWorld.mxml.
Now point to the directory containing HelloWorld.mxml and
type
p4d
--xml2p4d --flex HelloWorld.mxml
the result shall be a file named HelloWorld.p4d.
Open the file and look at its content:
from
EasyExtend.util.path import path
from EasyExtend.langlets.p4d.flexutils import mxcontrol
import EasyExtend.langlets.p4d.flexutils as flexutils
p4ddoc = mx:Application( verticalAlign="middle"
viewSourceURL="src/HelloWorld/index.html"
height="160"
horizontalAlign="center"
width="300"
xmlns:mx="http://www.adobe.com/2006/mxml" ):
mx:Panel( paddingRight="10"
title="My Application"
paddingBottom="10"
paddingTop="10"
paddingLeft="10" ):
mx:Label( text="Hello
World!", fontSize="24", fontWeight="bold" )
if __name__ == '__main__':
flexutils.exec_mxml(p4ddoc, "helloworld",
"HelloWorld.swf", path(__file__).dirname())
|
Now run the file as an ordinary P4D script:
p4d HelloWorld.p4d
P4D flexutils converts the p4ddoc element back to MXML. Then it runs
the Adobe AIR amxmlc compiler, creates a file called
helloworld-AIR-app.xml and finally executes this file using the adl
tool. The result shall be the following FlashPlayer box:

This procedure requires Flex 3 and AIR being installed on your system.
The command line tools have to be found. You don't configure any paths
in flexutils.p4d ( which is in the p4d langlet directory ) or elsewhere
e.g. in conf.py of the P4D langlet.
Example 2:
This time we embedd an AS3 handler in the MXML file.
<?xml
version="1.0"?>
<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml">
<mx:Script>
<![CDATA[
private function hello():void {
textarea1.text="Hello World";
}
]]>
</mx:Script>
<mx:Panel title="My Application"
paddingTop="10"
paddingBottom="10"
paddingLeft="10"
paddingRight="10"
>
<mx:TextArea
id="textarea1"/>
<mx:Button label="Submit"
click="hello();"/>
</mx:Panel>
|
The resulting p4d file looks like
from
EasyExtend.util.path import path
from EasyExtend.langlets.p4d.flexutils import mxcontrol
import EasyExtend.langlets.p4d.flexutils as flexutils
p4ddoc = mx:Application( xmlns:mx="http://www.adobe.com/2006/mxml" ):
mx:Script:
{**
private function hello():void {
textarea1.text="Hello World";
}
**}
mx:Panel( paddingLeft="10"
paddingRight="10"
paddingBottom="10"
paddingTop="10"
title="My Application" ):
mx:TextArea( id="textarea1" )
mx:Button( click="hello();",
label="Submit" )
if __name__ == '__main__':
flexutils.exec_mxml(p4ddoc, "helloworld",
"HelloWorld.swf", path(__file__).dirname()) |
Running the script we finally get:

2008-08-16
P4D Langlet 1.2.3 released
For more information see the P4D docs.
About the STOP symbol I talked about in my last news entry. Unlike most
other Trail features it was easy to implement.
2008-08-14
Messing around with ambiguities in Trail
Trail can be a quite powerful parser generator but sometimes it doesn't
let you express the obvious. Trail always does longest match and this
has implications sometimes. I realized this when defining lexical
analysis for Umlaute in Teuton. When I defined 'für' as a keyword I had
to be aware that 'für' can also be an arbitrary NAME from the
perspective of the lexer. So the lexer doesn't match 'für' as an own
entity but as a NAME instead. I resolved this by adding a blank: 'für' -> 'f'+'ü'+'r'+' '.
The string 'für ' is not a NAME anymore. This worked because the blank
is the single allowed character after 'für' but this depends on more
syntax.
What if we use a second exit symbol? When None is in
the current selection the parser is allowed to terminate and leave the
used NFA. Suppose now there is an explicit symbol STOP. Now we can have
a selection like ['x', 2345, None, STOP]. Here None can
be erased from the selection in case STOP was found. They have
precisely the same effect except that STOP was used explicitely in the
Token.ext specification. With STOP we can define
FUER: ('fuer' | 'für' ) STOP
Now NAME shall never match 'fuer' or 'für' anymore but it shall be
matched by FUER.
2008-08-13
EasyExtend 3.0.1 and P4D Langlet 1.2.2 released. Notice
that the distributed P4D package has been renamed because some people
complained that there is already an equally named project where P4D
stands for Python For
Delphi. IMO with P4D Langlet
there is enough context present for avoiding confusion.
Many thanks to Gerhard Häring who contributed patches for Teuton and EE
setup.py.
2008-08-10
The worst bugs that can be made in EasyExtend langlets are grammar bugs. I've found one in P4D
1.2. Here is the reason why
In P4D 1.2 Token.ext specifies following comment rule:
P4D_Comment:
'(' '*' ('*'|ANY)* '*' ')'
|
A P4D_Comment can't freely be used within Python code but only within
P4D elements. But this won't make the definition safe because the lexer
acts globally and this leads to a conflict when e.g. foo(*arg) is
called because the the syntactical context is not available for the
lexer and so it expects a closing "*)" like this foo(*arg*).
These bugs are terrible because they lead to incompatible changes of
the language. The only consolation I have is that P4D is out now for 6
weeks only.
In P4D 1.2.1 i changed this rule from parens to braces:
P4D_Comment:
'{' '*' ('*'|ANY)* '*' '}'
|
This is both compliant with Python 2.X and Python 3.0.
2008-08-09
Last week I released generator_tools 0.3.4. It is clear now that there
are bugs in generator_tools that won't ever be resolved ( see also the Caveats and
Limitations section of the docs ). However I've ideas how to
workaround some of the most severe limitations using EasyExtend
techniques. I've added a new module called enhanced_generators.py
that defines code transformations in an EE style. This module is not
ready to use right now and it will be introduced not before
generator_tools 0.4.
A new P4D release is out now. P4D 1.2 contains parts of my reified
wisdom from previous corporate projects, namely Bytelets. Unlike XML, Bytelets are
used to express binary data. The role of P4D data-structures is now
twofold: they simplify writing textual data that can be serialized as
XML as well as writing binary data that can be serialized as hexcode. I
found Bytelets ideal for testing applications that communicate on the
level of binary protocols. Moreover P4D 1.2 repurposes and extends
hexadecimal numbers which do not have much life on their own in Python.
Hex objects are now used to repesent hexadecimal literals and Hex
objects together with P4Ds tree shaped data structures constitute the
fundaments of Bytelets.
2008-07-28
Right now jut some intermediary notes about the three main projects.
1. generator_tools
The generator_tools package can't deal with Python 2.5 style yield
expressions. Once again jump generation heuristics have to be extended.
The current heuristics cause interpreter crashes due to stack
corruptions. Since I've seen no easy way to track them using CPythons
source and VS 6 I've written a toy Python interpreter in Python. This
way I could at least grasp the cause of the bug.
2. P4D
P4D 1.2 is in preparation. While P4D 1.1 was all about XML / P4D
conversions, P4D 1.2 is about Bytelets.
Bytelets are dataflow networks used to deal with binary data whose
canonical representations are BER encoded T(ag)L(ength)V(alue)s.
Bytelets are an attractive and flexible way to encode binary data. This
liberates P4D from being just another XML representation or
particularly bound to textual data formats.
3. Trail
Trail again. I did some perfomance optimizations by porting even more
parts of Trails algorithms to Cython. Cython is used a little more
properly now by exploiting Cythons C type declarations. The results are
o.k. but I'm not amazed by those additional speedups. The most
advantages I made were due to caching techniques rather than Python to
Cython transitions.
The speed of tokenization still depresses me. In fact I feel
tokenization is a bit underengineered i.e. not well adapted to the
target of producing a token stream. Generating parse trees instead
using a standard grammar ( i.e. Token.ext ) and merging streams of
terminals in order to get NAME, STRING or NUMBER token seems to be a
bit odd. It works fine for handling whitespace and other INTRON stuff
but creating lengthy sequences of token for each character makes the
algorithm inherently slow.
3.1 FastMatcher
I did some initial steps to move out of this trap. I've written a
_fastmatcher.dll in C++ and a
FastMatcher class in Python and a tiny amount of C glue code that lets
me use ctypes. Fast matcher takes an NFA description and information
about PseudoToken ( i.e. plain characters ) and builds an internal
table. This table is then used to match strings.
>>>
import EasyExtend.langlets.zero.lexdef.lex_nfa as lex_nfa
>>> fm = FastMatcher(lex_nfa.nfas[322],
lex_nfa.pseudo_token) # Rule 322 describes hexadecimal numerals
>>> fm.matchstr("0x"+"A8"*1000000, 0)
2000002
|
The method matchstr of FastNumber takes two arguments: a string and a
reading position. It returns the index of the first character after the
match. Here we passed a string of length 106+2
and FastMatcher recognizes it in 0.35s on my 1.5 GHz notebook. This is
slower that regular expression matching with "0x[0-9a-fA-F]". Actually
the regexp matches the expression in 0.03s and is therefore 10 times
faster. However a slight reorganization of the regexp into
"0x(?:[0-9]|[a-fA-F])" leads to the opposite result. Now the regexp
matcher needs 1.25s and is around 3 times slower than FastMatcher.
Since the majority of expressions are described using such branches I
expect that FastMatcher to be more appropriate for matching expressions
described by EBNF grammars. This still needs more investigation but it
is encouraging.
However it is yet not clear how to use FastMatcher besides applying it
to the most simple automata. More preparation is required:
- We need to know which token are really used by the parser.
If there are for example just NAME, STRING and NUMBER used in the
Grammar we don't need the luxury of 80 different token being described
in Token and Token.ext. This can be determined statically under the
premise that when Token/Grammar or Token.ext/Grammar.ext both change
the computation of the new parse_nfa precedes that of lex_nfa. This
isn't a big deal because changes apply automatically.
- Basins of attraction ( attractors ). Suppose we have to
distinct traces for two token types T1 and T2. A trace based parser
like Trail might follow the traces for T1 and T2 simultanously in an
expanded NFA for an initial section of a full trace. At some point they
become finally separated and all labels for the final sections of those
traces are in attractive domains
of T1 or T2. In that case it makes sense to replace the trace sections
in the attractive domains by labels that link to an NFA that can be
treated by FastMatcher.
- Do we need a second expansion when a trace A -> B ->
C contains a label B which refers to an NFA that can be treated by a
FastMatcher and B is not among the labels the parser has to care for (
according to 1. ) ?
- We need some policies which care for NFAs which shall not be fast traced. This is
basically a demand for INTRON ( what if INTRON is the only special case
where we wish to have the full parse tree structure for postprocessing?
)
2008-07-01
EasyExtend 3.0.1 and P4D 1.1 released!
Lots of bugfixes and new features in P4D
langlet where the active development takes place righ now.
2008-06-26
EasyExtend 3.0 - fin and P4D 1.0 released!
After almost two months and lots of bug-fixes I've finally set EE 3.0
on final. Major changes are
- For the first time I included a C extension module called
cyTrail with the distribution. cyTrail was built using cython and it is
optional. So EE works also in the absence of it. cyTrail enables some
optimizations on Trails lexing/parsing.
- The P4D langlet was included.
The P4D langlet
P4D means Python for Data and
it is roughly an XML dialect embedded in Python. P4D was inspired by
the E4X
( ECMAScript for XML ) standard but instead of embedded XML literals I
selected a representation that is more akin to SLiP. I tend to characterize
P4D by using following equation
P4D = E4X - E - X
P4D is distributed within EE but there is also a standalone
distribution. Roles are changed: in the one case P4D becomes yet
another example of EEs capabilities while in the other a minimal EE is
just a supportive framework.
2008-04-04
Some applications of left-recursion elimination in Trail
The classical example of an immediate
left recursive rule has the following form.
A textbook method to
avoid immediate left-recursion is to transform S and expand the rule
set using a new non-terminal S':
Here ()
is the empty production.
We translate this into a Trail NFA
(S,
0, S): [(b, 1, S)]
(b, 1, S): [(S', '.', 2, S)]
(S', '.', 2, S): [(a, 3, S), (None,
'-', S)]
(a, 3, S): [(S', '.', 2, S)]
|
and finally eliminate S' again
(S,
0, S): [(b, 1, S)]
(b, 1, S): [(a, 3, S),
(None, '-', S)]
(a, 3, S): [(a, 3, S), (None, '-',
S)]
|
This is of course the same result as we got when we had started with
Another example is
Applying the same sequence of steps as above we have
from which we derive
(S,
0, S): [(b, 1, S)]
(b, 1, S): [(S', '.', 2, S)]
(S', '.', 2, S): [(a, 3, S), (None,
'-', S)]
(a, 3, S): [(S, 4, S)]
(S, 4, S): [(S', '.', 2, S)]
|
and finally get
(S,
0, S): [(b, 1, S)]
(b, 1, S): [(a, 3, S),
(None, '-', S)]
(a, 3, S): [(S, 4,
S)]
(S, 4, S): [(a,
3, S), (None, '-', S)] |
The NFA is equivalent with
Discussion
Elimination of immediate left recursion leads to right recursion as we
have seen in the second example. Although the transformation can be
perfomed succinctly using Trails NFA representation of the grammar it
still doesn't seem to be quite right. First of all a grammar rule is
also always an interface description or a pattern in EasyExtend.
Secondly we can use left recursion to express a different operator
precedence as we do with right recursion. So the expressions are
inequivalent. This might not be much of a problem in EasyExtend because
it relies on Pythons Grammar which is LL(1) and expresses operator
precedence just indirectly within a chain of rules. It is a problem for
satellite langlets i.e.
small associated langlets though.
From a design perspective I'm not entirely sure what to make out of it?
Banning immediate left recursion? Issuing a warning that gets ignored
by the user or just transform aggressively?
2008-04-02
Convergence and advancements
Fixing through advancing
The last two weeks I spent much time with bugfixes on the Trail
implementation. Unfortunately I often see bugs only by advancing the
whole system. So instead of moving slowly towards a stable release in a
convergent process I seek new problem domains that challenge EE and
introduce complexity on their own. The last of the many splinter
projects that build a cloud around EE is a variant of E4X.
P4X
Among other things E4X deals with embedding of XML literals in
ECMAScript languages. Creating XML literals in Python using EasyExtend
is not yet feasible. Trail is simply not powerful enough to express the
context sensitivity that arises from combining XML and Python in a
seamless fashion. Instead a new concept starts to emerge which I call ParserHooks of satellite langlets. A P4X langlet
delegates parsing of XML to another langlet which I called minxml.
Again minxml can delegate parsing Python expressions ( when XML is used
as a templating language ) to P4X. But this requires some adaption of
the infrastructure. ParserHooks require a pattern, some registration
mechanism and so on.
Prior to this design effort I struggled with moving definitions from
the lexer to the parser and back. It didn't work too well but at least
I found a serious bug in Trails algorithm that builds parse tree from
NFAMultiTraces + subtrees. This algorithm was ugly and incomprehensible
so I ripped it off and replaced it by a cleaner stack machine based
approach.
The perils of expansion
These simple correctures are much unlike the elefant in the boat that
shakes the whole system around. Trails expansion mechanism is the part
of EasyExtend I find the hardest to understand. Not so much in the
basic outline but in its implications. Expansion must somehow capture
everything that deviates from
a simple LL(1) grammar - and it fails. Two of the failures are well
known and lead to warnings. These are LeftRecursion and RightExpansion.
RightExpansion is always only detected after the expansion process and
has zero influence on the expansion process. Issuing RightExpansion
warnings is morally correct IMO. But although Trail terminates rule
expansion when left recursion is detected it doesn't suffice to prevent
hang ups. So I defined some artificial bounds similar to recursion
limits that terminate the system with an exception. Unlike
RightExpansion I don't think LeftRecursion is immoral and it shall be
made possible. I made some drafts on modifying Trail to enable
LeftRecursion but I'm scared about the additional complexity brought to
some already badly understood algorithm.
2008-03-16
EasyExtend 3.0 - beta2 released!
Two weeks ago I released the first beta of EasyExtend 3.0. Now
the changelog has been filled with
bugfixes and improvements. I also
re-integrated the fun langlet Teuton
and the coverage langlet which
finally contains a long promised functionality: it covers expressions
in
boolean operations s.t. it can detect that an expression 1/a
in the boolean operation
True or 1/a
is never executed.
Tokenization is still rather slow ( 9.000 - 10.000 chars/s on my 1.5
GHz notebook or ~220 LOC / s ) but I removed a serious performance
killer in Trail that accidentally lead to exponential slow down of
parsing in some cases. The transformer debug API is more
complete. It's also simpler to pass extra arguments into node
transformers.
What's left ?
doctest is not running properly. This indicates some signifcant
variances in
the behaviour of Trail and Pythons own parser. The coverage langlet
needs a bit more polishing. However further Trail performance
optimizations are beyond the scope of the current 3.0 release.
Stabilization and testing have higher priority at the moment.
2008-03-04
EasyExtend 3.0 -beta1 released!
After more than half a year of work I released the first beta of
EasyExtend 3.0 today. EasyExtend 3.0 is the second major redesign of
EasyExtend. To gain more power and simplicity I implemented a new
parser generator from the scratch called Trail. Trail unifies several
aspects of EasyExtend. It is the base of both the EE 3.0 tokenizer and
the parser. It is used to validate parse trees against grammars and it
is used for parse tree synthesis.
Besides Trail other new aspects of EE 3.0 are
- User defined file suffixes are recognized by the import
machinery
- Framework extension that supports facilitation of writing
user defined tokenizers
- Simplification of access to tokenizer and parser from
individual extension languages ( langlets ).
- Improved stability of the CST transformation process and
improved debugging facilities
Currently EasyExtend 3.0 does not support several extension languages
being described on the main page. Some of them like the macro langlet
will be reimplemented in EasyExtend 3.1 using Trail techniques and a
new persistence layer called exo.space. Others will be upgraded until
the final release of EasyExtend 3.0.
|