Easy Extend


                                                                


2009-03-01

Trails Of EasyExtend

This News Site has turned into an almost regular but somewhat impoverished blog.

I've taken advantage now from wordpress and started a blog called Trails Of EasyExtend. So nothing is lost and the fun can continue.

2009-02-04

The postlexer

Once again C++ parsers are in the news. Well, these are old news and the quoted article might be hopelessly dated. However redditors provide useful links to projects like Pork.

My impression of this article is that the author is somewhat confused. He states that parsing C++ is hard because it isn't LALR and later withdraws his jugdement and says syntactical analysis is easy though not trivial. Semantic analysis is the hard stuff. But the parsing process depends on semantic analysis. One cannot parse C++ without knowing whether A * B denotates a product of values or a type declaration. There might be other such cases and to know about them would have been invaluable information.

I do not want to consider the C preprocessor here and just expect that the C++ source being parsed is the output of the preprocessor. EasyExtends approach to deal with context sensitivity is called parser sandwiches. A parser sandwich has three layers.
  • A lexer. Reads a source string and outputs a token stream. A lexer can use a context free grammar or some regular expression engine. The basic assumption is that it is context free.
  • A postlexer. Reads a token stream and returns a possibly modified token stream.
  • A parser. Reads a token stream and creates a parse tree. The parser shall be context free.
Unlike lexer and parser the postlexer is context and language dependent. It is the single place where EE processes significant whitespace for example. The INDENT / DEDENT / NEWLINE token are not the product of the lexer but the postlexer. The postlexer is freely programmable in imperative style using Python.



                 
                     Parser sandwich


2008-11-26

Here is a PDF presentation in which computing scientists Bernard Stepien and Liam Peyton compared Python with TTCN-3. TTCN-3 stands for Testing and Test Control Notation and it is domain specific language specifically designed for writing tests in the domain of telecommunication. In almost any category poor Python just loses. There are some annoying aspects in the presentation like containing not even conistent Python code at several places ( examples: on slide 42 an "else if" construct is used instead of the correct "elif". On slide 13 fields are not assigned to self which leads to code that will just fail to run ). But leaving this carelessnes aside there is something that estranges an experienced Python programmer. The way how Stepien and Peyton compare TTCN-3 language constructs to those of Python is odd. Take slide 12 for example:

TTCN-3 templates to Python
TTCN-3 templates could be mapped to Python object instances.
• However, there are serious limitations using the above technique with Python.
• Python objects are practically typeless


Then the presentation goes over 18 slides just to explain how bad plain Python classes are for the same purposes as TTCN-3 templates. They are correct about it but who cares? If one wants to reproduce the behaviour of templates one creates a particular type in Python.

In the rest of the article I demonstrate how to create TTCN-3 like templates in Python

class Template(object):
    def __init__(self):
        self._frozen = False
        self._fields = []        
        
    def __setattr__(self, name, value):
        if name in ('_fields', '_frozen'):
            object.__setattr__(self, name, value)
            return
        if self._frozen:
            raise RuntimeError("Cannot set attribute value on a frozen template")
        if name not in self._fields:
            self._fields.append(name)
        object.__setattr__(self, name, value)

    def __eq__(self, other):
        if len(self._fields)!=len(other._fields):
            return False
        else:
            for f_self, f_other in zip(self._fields, other._fields):
                val_self  = getattr(self, f_self)
                val_other = getattr(self, f_other)
                if val_self != val_other:
                    return False
            return True
        
    def __ne__(self, other):
        return not self.__eq__(other)

    def freeze(self):
        self._frozen = True
       
    def clone(self):
        T = Template()
        T._fields = self._fields[:]
        for field in T._fields:
            setattr(T, field, getattr(self, field))
        return T

The basic idea of this class called Template is to manage an internal _fields attribute that contains the attribute names created by __setattr__ dynamically. This is done to preserve the order of the fields on comparison. In the current implementation there aren't any fields marked as optional and it will take additional effort to introduce them and modify __eq__. But it is easy to demonstrate the behaviour of wildcards. First we have to introduce another class for this purpose:

class Wildcard(object):
    def __eq__(self, other):
        return True
   
    def __ne__(self, other):
        return False

wildcard = Wildcard()      

Whatever object X is compared to wildcard it is always X == wildcard == True. So we can create

templ_1 = Template()
templ_1.field_1 = wildcard
templ_1.field_2 = "abc"

templ_2 = Template()
templ_2.field_1 = "xyz"
templ_2.field_2 = wildcard

print templ_1 == templ_2  # -> True

A Pattern class can be created the same way as the Wildcard class

class Pattern(object):
    def __init__(self, regexp):
        self.pattern = re.compile(regexp)

    def __eq__(self, other):
        try:
            return bool(self.pattern.match(other))
        except TypeError:
            return True

    def __ne__(self, other):
        return not self.__eq__(other)

An '==' comparison of a Pattern instance with a string will match the string against the pattern and yield True if the string could be matched and False otherwise.

So it's perfectly possible to reproduce most aspects of TTCN-3 templates by just a few lines of Python code. On the downside of TTCN-3 this isn't possible the other way round. No matter how hard you work in TTCN-3 it is impossible to create new types showing interesting behaviour.                                                                       

2008-11-24

EasyExtend and Jython

I checked out EasyExtend with Jython. I thought the only relevant adaption that has to be made is consequently emitting Python source after transformation since I'm not inclined to shoehorn CSTs into Jythons ( or ANTLRs ) internal AST format. But then I got:

java.lang.ClassFormatError: Invalid method Code length 124516 in class file parse_nfa$py

This stems from an internal limitation of the JVM that permits method sizes of at most 64K. Welcome to 1980! Of course EasyExtend doesn't define methods of this size. It's just that the parse-tables ( nfas ) exceed those limits and Jython maps those tables onto methods. I dare to say that even breaking the whole table into parts doesn't fix the problem because expansion leads to huge single NFAs occasionally. I'm not going to fix this and therefore this is the end of the story of EE on the JVM.

2008-11-17

Recursive embeddings

Recursive embeddings are the single most challenging aspect of EasyExtend and Trail. It's also one of the topics that advances top down parsers in general. So what comes might be an original contribution to the practice of building parser generators.

A recursive rule contains its own rule-symbol also in the rule defining part on the RHS. This isn't always a problem. It is one only in case of a conflicting decision. Take a look at the following rule:

A: '[' A ']' | 'a'

There is no conflict that has to be resolved. Same goes with

B: '[' B ']' | 'b'

of course which has the exact same structure. But now combine them into

C: A | B
A: '[' A ']' | 'a'

B: '[' B ']' | 'b'

and we notice an ambiguity. Finding a left bracket "[" in the token stream gives no indication whether A or B is the correct rule that has to be selected to build the parse tree. The only strategy used by Trail to resolve ambiguities is embedding/expansion. But this won't help here because expansion was cyclic and we'd run into an infinite expansion. All of this happens on the level of NFAs and we need to examine one before we can proceed:

(A, 0, A): [('[', 1, A), ('a', 4, A)],
(A, 2, A): [(']', 3, A)],
('[', 1, A): [(A, 2, A)],
(']', 3, A): [(None, '-', A)],
('a', 4, A): [(None, '-', A)]}],

The start symbol is (A, 0, A). After start the NFA branches into a bracket '[' and a letter 'a'. The bracket trace leads to (A, 2, A). Since A is found in the first entry of (A, 2, A) the NFA is called recursively. The NFA only terminated when an EXIT symbol (None, '-', A) is entered.

When we try to embedd this NFA in C together with the NFA of B we run into a clash because we have [(A, 2, A) , (B, 2, B)] among the transitions. Clashes can be resolved by means of expansion but only if we don't run into cycles as mentioned above. But in this case we have a cycle. So what can we do? I'll describe no an NFA transformation and explian why it works in the context of Trail:

First we split (A,2,A) into a pair (A,'(', 2, A) , (A, ')', 2, A) which affects the NFA in the following way:

(A, 0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A): [(']', 3, A)],
(A, '(', 2, A): [(A, ')', 2, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [(None, '-', A)],
('a', 4, A): [(None, '-', A)]}],

Now we give an interpretation of (A,'(', 2, A). It shall act like a goto to the start of the NFA. This is a bit like a tail recursion elimination. Now it's not the start symbol but the first set it shall link to.

(A, 0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A): [(']', 3, A)],
(A, '(', 2, A): [('[', 1, A), ('a', 4, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [(None, '-', A)],
('a', 4, A): [(None, '-', A)]}],

The complementary label replaces now the EXIT symbol.

(A, 0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A): [(']', 3, A)],
(A, '(', 2, A): [('[', 1, A), ('a', 4, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [
(A, ')', 2, A)],
('a', 4, A): [
(A, ')', 2, A)]}],

Now we have lost the EXIT symbol but it's clear that it can only be the successor of the clsoing label:

(A, 0, A): [('[', 1, A), ('a', 4, A)],
(A, ')', 2, A): [(']', 3, A), (None, '-', A)],
(A, '(', 2, A): [('[', 1, A), ('a', 4, A)],
('[', 1, A): [(A, '(', 2, A)],
(']', 3, A): [
(A, ')', 2, A)],
('a', 4, A): [
(A, ')', 2, A)]}],

This is already the final NFA. But there is something wrong here. When you look at the transition [(']', 3, A), (None, '-', A)] it cannot be a real alternative because whether you choose ] or None depends how often you entered (A,'(', 2, A)  compared to (A, ')', 2, A). So we need a counter that matches opening against closing braces. This isn't of course hard in itself but notice that at each point in the automaton we can branch into different labels so we need a counter for each trace. Fortunately it's easy in Trail because it uses a spaghetti- or cactus stack to store labels anyway. So all we need to do when we enter the (A, ')', 2, A) is whether it can be balanced or not. If

    count((A, ')', 2, A)) > count((A, '(', 2, A))

then (None, '-', A ) is selected otherwise the parsing process continues selecting (']', 3, A).

Grammars that are accepted by Trail

Here are some examples:

G: A B [G] A C

G: A | B
A: "(" A ")" | a
B: "(" B ")" | b

G: A | B | x
A: a b G* a
B: a c G* a

The latter is particular interesting because it describes mutual recursion. Not only are A and B expanded in G but also G is expanded in A and B.

Limitations

There is not yet a clear scope of applicability nor an analytical treatment / classification. Some grammars are clearly rejected while others might just not work right now. They can also be classified according to the exception they raise.

Direct left recusion

A classical example for direct left recusion looks like

E: E '+' E | e

Those grammars are most easily detected and transformed. In this case one better writes:

E: e ('+' e)*

Many recursion locations

With "many recursion locations" I mean rules like this:

X : "{" X "}" X | "{" "}" | "a" | "a" X

The rule itself isn't problematic at all but when you insert it into

G: A | B
A: "(" A ")" | X
B: "(" B ")" | b

It produces wrong behaviour. Labels like (A, '.', IDX, A) are created which are an abomination.

Unstoppable expansion cycles

In this case it is impossible to determine the next nid selection. I implemented just a dumb overflow criterion that cancels the expansion process. This criterion is related to the depth of the grammar. The depth of the grammar is measured like this:
  • depth(N) = max(depth(K)) K in FirstSet(N)
  • depth(T) = 0 for T is terminal
  • depth(Grammar) = max(depth(N)) N is a rule nid
Notice that by computing the depth of the node we exclude immediate left recursion in practice: an immediate left recursive rule is in its own first set and we don't get a result. In case of recursive embeddings the depth of the Grammar might be increased. If this increase exceeds 10 times the depth of the original grammar an exception is raised. The following grammar is not admissable according to this rule:

GA: GA1 | GA2
GA1: 'a' GA2
GA2: 'a' GA1 | 'b'

Summary

I'd like to have a better analytical insight into the recusive embedding process. So far it works for some simple cases. Some of them are even convenient. But we can give other simple examples of recusive grammars that fail. I'm not really able to see a universal pattern right now and I'm not even sure this has to be deeply investigated? It's a good leap forward after all.

2008-10-08

About some grammars Trail can now deal with in a sane way

Here is one:

R: A | B
A: C A | D
B: C B | E

Expansion of A in R causes now a LeftRecursionWarning and we get a final RuntimeError exception

                  RuntimeError: More than ten expansion warnings issued. Expansion is terminated!

Solving the problem was finally easy. When a label L = (A, i, A) is expanded in R one among the labels that are embedded has the form
(A, shift(j), A). In a next step this we expand this label and this yields a new one (A, shift(shift(j)), A) . So all we need to cancel the infinite expansion is to set a threshold for the index value i in (A, i, A). Since a rule can be quite big the value shall not be too small, e.g. 100. This threshold will be passed very soon.

2008-10-07

We also know there are known unknowns; that is to say we know there are some things we do not know. But there are also unknown unknowns - the ones we don't know we don't know.
                                                                                 Donald Rumsfeld, 2002-02-12

About some grammars Trail cannot deal with in a sane way

Here is one:

R: A | B
A: C A | D
B: C B | E

Trail will try to expand A or B in R and will expand forever because it will never run into a left recursion of R which would lead to immediate termination of expansion. Having rules that are right-recursive ( both A and B ) alone isn't sufficient to avoid trouble. We could transform rules into FunNets individually but then there is no hint on how to expand them which is still necessary.

2008-10-03

With the use of cyTrail + psyco and caching techniques I took a little pressure from Trails performance. I'm still dissatisfied with lexer performance and I've talked about ideas on how to resolve performance issues but I'm not anyhow close to an implementation.

Left recursion again...

It buggs me that Trail has obvious limitations in parsing expressions according to very simple rules. For example the following rule

S: a b [S] a c

leads to warning messages when Trail produces NFAs. This is because S has to be expanded within itself and left recursion is prohibited.

A practical example for what can't be parsed using Trail can be given as well:

xml_element: '<' qname '>' ( ANY | xml_element )* '</' qname '>'

Advancing Trail

Suppose we eliminate S on the RHS of the rule in a state transition diagram. A first attempt to draw the diagram might look like this:


It is insufficient though because sequences like

  a b a b a c None

  a b a c a c None

are valid while they are clearly not recognized by the grammar of rule S.

In a second attempt we draw a two dimensional diagram by cloning the basic sequence and linking the sequences



This spatial diagram precisely expresses what we were looking for. But we can't realize the infinite diagram by a fixed set of nodes of the kind we've used in a Trail NFA description. So we have to generalize Trail NFAs by a network of connected functions. The up/down shift in levels of the diagram is expressed using a single parameter.

Ansatz:

f0 = lambda k: (('a',0,'S'), [(f1, k)])
f1 = lambda k: (('b',1,'S'), [(f2, k),(f0,k+1)])
f2 = lambda k: (('a',2,'S'), [(f3, k)])
f3 = lambda k: (('c',3,'S'), [(f4, k)])
f4 = lambda k: (((None, '-', 'S'),[]) if k == 0 else [('a',2,'S'),(f3,k-1)])

Each lambda expression takes a single parameter and returns a Trail NFA label and a list of lambda expressions and parameters that have to be passed to continue. The list of successor functions can be empty and cause the moves coming to halt.


a b a b a c a c None
fo(0)
('a', 0,'S')
(f1, 0)
a
f1(0)
('b',1,'S')
(f2,0),(f0,1)
b
f2(0),f0(1)
('a',0,'S'),('a',2,'S') (f1,1),(f3,0)
a
f1(1),f3(0)
('b',1,'S'),('c',3,'S') (f2,1),(f0,2),(f4,0)
b
f2(1),f0(2)
('a',0,'S'),('a',2,'S') (f1,2),(f3,1) a
f1(2),f3(1)
('b',1,'S'),('c',3,'S')
(f2,2),(f0,3),(f4,1)
c
f4(1)
('a',2,'S') (f3,0)
a
f3(0)
('c',3,'S') (f4,0)
c
f4(0)
(None,'-','S')

None

Open problems

Let be

S: a b [S] A c

with FirstSet(A) /\ FirstSet(a) != {}. How are expansion and the NFA -> FunNet transformation related to each other? The whole point of the FunNet transformation is that we can't apply expansion on S! A possible appraoch is to solve a specialized problem first, perform the necessary substitutions and finally transform the NFA into a FunNet.

S: a b [a b A c] A c

Two questions
  1. Is there are universally applicable specialization?
  2. How has the subsitution to be performed with the expanded NFA s.t. the original general form can be restored?

2008-09-19

Langscape

EasyExtend 3.1 will contain a new package called langscape and a new module called eelangscape.py placed in the root directory. With langscape it will be possible to create new language features on the fly without prior definition of Grammar.ext and Token.ext files. This departure from the classical EE notwithstanding the langscape operators and API rely entirely upon the established infrastructure. The most simple part has been finished already and it is possible to roughly the following

from EasyExtend.eelangscape import SyntaxExtension, binding, binding_strength

def concat_handler(a, b):
    '''
    Define concatenation between two objects
    '''
        if isinstance(a, int) and isinstance(b, int):
            return int(str(a)+str(b))
        elif isinstance(a, list) and isinstance(b, list):
            return a+b
        elif isinstance(a, tuple) and isinstance(b, tuple):
            return a+b
        raise TypeError("Arguments cannot be concatenated")

syntax = SyntaxExtension()
syntax.add_binary_operator("++", "PLUSPLUS", binding( binding_strength == '_ + _'), concat_handler)

That's it. It doesn't need more to define a binary ( infix ) operator anymore.

It's not yet clear to me yet how to advance this simplicity beyond the most simple case ( shown here ). This one works well because we just have to replace expressions of the form "a ++ b" by  "concat_handler(a, b)" expressions. This can be done uniformly for all binary operators. But when facing more complex expressions like statements I wonder if we end up defining grammar rules and performing CST transformations again just in a less general setting? Maybe one way around this is to get into a PHP-ish API: a large flat function list that covers most of the functionality.



2008-09-18

Michal Wallace mentioned EasyExtend in his blog post among a few other quite complex and more famous Python projects with a particulary sucking documentation. Well, I just want to say that documentation evolves as projects mature. I don't spend too much time into refining docs when I'm not confident with the system and also do major re-workings and perspective changes.

2008-09-13

P4D Langlet 1.2.4 released

P4D 1.2.4 contains some cleanups regarding subelement rules. Checkout the rewitten section 2.7 in the P4D docs that deals with those issues.

Flexutils

In the meantime I'm busy learning Adobe Flex and work through it using P4D. Not sure who else refuses to use Adobe FlexBuilder when working Flex? Flex and particular MXML invites to use an IDE and the single one that is capable to deal with MXML properly stems from Adobe. Clever. I want to use Flex like a scripting language containing only P4D elements and ActionScript code.

Example 1: consider the HelloWorld MXML snippet. Store it as HelloWorld.mxml. Now point to the directory containing HelloWorld.mxml and type
             p4d --xml2p4d --flex HelloWorld.mxml

the result shall be a file named HelloWorld.p4d. Open the file and look at its content:

from EasyExtend.util.path import path
from EasyExtend.langlets.p4d.flexutils import mxcontrol
import EasyExtend.langlets.p4d.flexutils as flexutils

p4ddoc = mx:Application( verticalAlign="middle"
                viewSourceURL="src/HelloWorld/index.html"
                height="160"
                horizontalAlign="center"
                width="300"
                xmlns:mx="http://www.adobe.com/2006/mxml" ):
    mx:Panel( paddingRight="10"
              title="My Application"
              paddingBottom="10"
              paddingTop="10"
              paddingLeft="10" ):
        mx:Label( text="Hello World!", fontSize="24", fontWeight="bold" )

if __name__ == '__main__':
    flexutils.exec_mxml(p4ddoc, "helloworld", "HelloWorld.swf", path(__file__).dirname())

Now run the file as an ordinary P4D script:

       p4d HelloWorld.p4d

P4D flexutils converts the p4ddoc element back to MXML. Then it runs the Adobe AIR amxmlc compiler, creates a file called helloworld-AIR-app.xml and finally executes this file using the adl tool. The result shall be the following FlashPlayer box:



This procedure requires Flex 3 and AIR being installed on your system. The command line tools have to be found. You don't configure any paths in flexutils.p4d ( which is in the p4d langlet directory ) or elsewhere e.g. in conf.py of the P4D langlet.

Example 2:

This time we embedd an AS3 handler in the MXML file.

<?xml version="1.0"?>
<mx:Application xmlns:mx="http://www.adobe.com/2006/mxml">
    <mx:Script>
        <![CDATA[
            private function hello():void {
                textarea1.text="Hello World";
            }
        ]]>
    </mx:Script>
    <mx:Panel title="My Application"
        paddingTop="10"
        paddingBottom="10"
        paddingLeft="10"
        paddingRight="10"
    >
        <mx:TextArea id="textarea1"/>
        <mx:Button label="Submit" click="hello();"/>

    </mx:Panel>

The resulting p4d file looks like

from EasyExtend.util.path import path
from EasyExtend.langlets.p4d.flexutils import mxcontrol
import EasyExtend.langlets.p4d.flexutils as flexutils

p4ddoc = mx:Application( xmlns:mx="http://www.adobe.com/2006/mxml" ):
    mx:Script:
        {**            
            private function hello():void {
                textarea1.text="Hello World";
            }
        **}
    mx:Panel( paddingLeft="10"
              paddingRight="10"
              paddingBottom="10"
              paddingTop="10"
              title="My Application" ):
        mx:TextArea( id="textarea1" )
        mx:Button( click="hello();", label="Submit" )


if __name__ == '__main__':
    flexutils.exec_mxml(p4ddoc, "helloworld", "HelloWorld.swf", path(__file__).dirname())

Running the script we finally get:




2008-08-16

P4D Langlet 1.2.3 released

For more information see the P4D docs.

About the STOP symbol I talked about in my last news entry. Unlike most other Trail features it was easy to implement.

2008-08-14

Messing around with ambiguities in Trail

Trail can be a quite powerful parser generator but sometimes it doesn't let you express the obvious. Trail always does longest match and this has  implications sometimes. I realized this when defining lexical analysis for Umlaute in Teuton. When I defined 'für' as a keyword I had to be aware that 'für' can also be an arbitrary NAME from the perspective of the lexer. So the lexer doesn't match 'für' as an own entity but as a NAME instead. I resolved this by adding a blank: 'für' -> 'f'+'ü'+'r'+' '. The string 'für ' is not a NAME anymore. This worked because the blank is the single allowed character after 'für' but this depends on more syntax.

What if we use a second exit symbol? When None is in the current selection the parser is allowed to terminate and leave the used NFA. Suppose now there is an explicit symbol STOP. Now we can have a selection like ['x', 2345, None, STOP]. Here None can be erased from the selection in case STOP was found. They have precisely the same effect except that STOP was used explicitely in the Token.ext specification. With STOP we can define

               FUER: ('fuer' | 'für' ) STOP

Now NAME shall never match 'fuer' or 'für' anymore but it shall be matched by FUER.

2008-08-13

EasyExtend 3.0.1 and P4D Langlet 1.2.2 released. Notice that the distributed P4D package has been renamed because some people complained that there is already an equally named project where P4D stands for Python For Delphi. IMO with P4D Langlet there is enough context present for avoiding confusion.

Many thanks to Gerhard Häring who contributed patches for Teuton and EE setup.py.

2008-08-10

The worst bugs that can be made in EasyExtend langlets are grammar bugs. I've found one in P4D 1.2. Here is the reason why

In P4D 1.2 Token.ext specifies following comment rule:

P4D_Comment: '(' '*' ('*'|ANY)* '*' ')'

A P4D_Comment can't freely be used within Python code but only within P4D elements. But this won't make the definition safe because the lexer acts globally and this leads to a conflict when e.g. foo(*arg) is called because the the syntactical context is not available for the lexer and so it expects a closing "*)" like this foo(*arg*).

These bugs are terrible because they lead to incompatible changes of the language. The only consolation I have is that P4D is out now for 6 weeks only.

In P4D 1.2.1 i changed this rule from parens to braces:

P4D_Comment: '{' '*' ('*'|ANY)* '*' '}'

This is both compliant with Python 2.X and Python 3.0.

2008-08-09

Last week I released generator_tools 0.3.4. It is clear now that there are bugs in generator_tools that won't ever be resolved ( see also the Caveats and Limitations section of the docs ). However I've ideas how to workaround some of the most severe limitations using EasyExtend techniques. I've added a new module called enhanced_generators.py that defines code transformations in an EE style. This module is not ready to use right now and it will be introduced not before generator_tools 0.4.

A new P4D release is out now. P4D 1.2 contains parts of my reified wisdom from previous corporate projects, namely Bytelets. Unlike XML, Bytelets are used to express binary data. The role of P4D data-structures is now twofold: they simplify writing textual data that can be serialized as XML as well as writing binary data that can be serialized as hexcode. I found Bytelets ideal for testing applications that communicate on the level of binary protocols. Moreover P4D 1.2 repurposes and extends hexadecimal numbers which do not have much life on their own in Python. Hex objects are now used to repesent hexadecimal literals and Hex objects together with P4Ds tree shaped data structures constitute the fundaments of Bytelets.

2008-07-28

Right now jut some intermediary notes about the three main projects.

1. generator_tools

The generator_tools package can't deal with Python 2.5 style yield expressions. Once again jump generation heuristics have to be extended. The current heuristics cause interpreter crashes due to stack corruptions. Since I've seen no easy way to track them using CPythons source and VS 6 I've written a toy Python interpreter in Python. This way I could at least grasp the cause of the bug.

2. P4D

P4D 1.2 is in preparation. While P4D 1.1 was all about XML / P4D conversions, P4D 1.2 is about Bytelets. Bytelets are dataflow networks used to deal with binary data whose canonical representations are BER encoded T(ag)L(ength)V(alue)s. Bytelets are an attractive and flexible way to encode binary data. This liberates P4D from being just another XML representation or particularly bound to textual data formats.

3. Trail 

Trail again. I did some perfomance optimizations by porting even more parts of Trails algorithms to Cython. Cython is used a little more properly now by exploiting Cythons C type declarations. The results are o.k. but I'm not amazed by those additional speedups. The most advantages I made were due to caching techniques rather than Python to Cython transitions.

The speed of tokenization still depresses me. In fact I feel tokenization is a bit underengineered i.e. not well adapted to the target of producing a token stream. Generating parse trees instead using a standard grammar ( i.e. Token.ext ) and merging streams of terminals in order to get NAME, STRING or NUMBER token seems to be a bit odd. It works fine for handling whitespace and other INTRON stuff but creating lengthy sequences of token for each character makes the algorithm inherently slow.

3.1 FastMatcher

I did some initial steps to move out of this trap. I've written a _fastmatcher.dll in C++ and a FastMatcher class in Python and a tiny amount of C glue code that lets me use ctypes. Fast matcher takes an NFA description and information about PseudoToken ( i.e. plain characters ) and builds an internal table. This table is then used to match strings.

>>> import EasyExtend.langlets.zero.lexdef.lex_nfa as lex_nfa
>>> fm = FastMatcher(lex_nfa.nfas[322], lex_nfa.pseudo_token)  # Rule 322 describes hexadecimal numerals
>>> fm.matchstr("0x"+"A8"*1000000, 0)
2000002

The method matchstr of FastNumber takes two arguments: a string and a reading position. It returns the index of the first character after the match. Here we passed a string of length 106+2 and FastMatcher recognizes it in 0.35s on my 1.5 GHz notebook. This is slower that regular expression matching with "0x[0-9a-fA-F]". Actually the regexp matches the expression in 0.03s and is therefore 10 times faster. However a slight reorganization of the regexp into "0x(?:[0-9]|[a-fA-F])" leads to the opposite result. Now the regexp matcher needs 1.25s and is around 3 times slower than FastMatcher. Since the majority of expressions are described using such branches I expect that FastMatcher to be more appropriate for matching expressions described by EBNF grammars. This still needs more investigation but it is encouraging.

However it is yet not clear how to use FastMatcher besides applying it to the most simple automata. More preparation is required:
  1. We need to know which token are really used by the parser. If there are for example just NAME, STRING and NUMBER used in the Grammar we don't need the luxury of 80 different token being described in Token and Token.ext. This can be determined statically under the premise that when Token/Grammar or Token.ext/Grammar.ext both change the computation of the new parse_nfa precedes that of lex_nfa. This isn't a big deal because changes apply automatically.
  2. Basins of attraction ( attractors ). Suppose we have to distinct traces for two token types T1 and T2. A trace based parser like Trail might follow the traces for T1 and T2 simultanously in an expanded NFA for an initial section of a full trace. At some point they become finally separated and all labels for the final sections of those traces are in attractive domains of T1 or T2. In that case it makes sense to replace the trace sections in the attractive domains by labels that link to an NFA that can be treated by FastMatcher.
  3. Do we need a second expansion when a trace A -> B -> C contains a label B which refers to an NFA that can be treated by a FastMatcher and B is not among the labels the parser has to care for ( according to 1. ) ?
  4. We need some policies which care for NFAs which shall not be fast traced. This is basically a demand for INTRON ( what if INTRON is the only special case where we wish to have the full parse tree structure for postprocessing? )

2008-07-01

EasyExtend 3.0.1 and P4D 1.1 released!

Lots of bugfixes and new features in P4D langlet where the active development takes place righ now.

2008-06-26

EasyExtend 3.0 - fin  and P4D 1.0 released!

After almost two months and lots of bug-fixes I've finally set EE 3.0 on final. Major changes are
  • For the first time I included a C extension module called cyTrail with the distribution. cyTrail was built using cython and it is optional. So EE works also in the absence of it. cyTrail enables some optimizations on Trails lexing/parsing.
  • The P4D langlet was included.

The P4D langlet

P4D means Python for Data and it is roughly an XML dialect embedded in Python. P4D was inspired by the E4X ( ECMAScript for XML ) standard but instead of embedded XML literals I selected a representation that is more akin to SLiP. I tend to characterize P4D by using following equation
                        P4D = E4X - E - X

P4D is distributed within EE but there is also a standalone distribution. Roles are changed: in the one case P4D becomes yet another example of EEs capabilities while in the other a minimal EE is just a supportive framework.

2008-04-04

Some applications of left-recursion elimination in Trail

The classical example of an immediate left recursive rule has the following form.

S: S a | b

A textbook method to avoid immediate left-recursion is to transform S and expand the rule set using a new non-terminal S':

S: b S'
S':a S' | ()

Here () is the empty production.

We translate this into a Trail NFA

(S, 0, S): [(b, 1, S)]
(b, 1, S): [(S', '.', 2, S)]
(S', '.', 2, S): [(a, 3, S), (None, '-', S)]      
(a, 3, S): [
(S', '.', 2, S)]

and finally eliminate S' again

(S, 0, S): [(b, 1, S)]
(b, 1, S): [(a, 3, S), (None, '-', S)]
(a, 3, S): [(a, 3, S), (None, '-', S)]      

This is of course the same result as we got when we had started with

S: b a*

Another example is

S: S a S | b

Applying the same sequence of steps as above we have

S: b S'
S': a S S' | ()

from which we derive

(S, 0, S): [(b, 1, S)]
(b, 1, S): [(S', '.', 2, S)]
(S', '.', 2, S): [(a, 3, S), (None, '-', S)]      
(a, 3, S): [
(S, 4, S)]
(S, 4, S): [(S', '.', 2, S)]

and finally get

(S, 0, S): [(b, 1, S)]
(b, 1, S): [(a, 3, S), (None, '-', S)]
(a, 3, S): [(S, 4, S)]      
(S, 4, S): [(a, 3, S), (None, '-', S)]

The NFA is equivalent with

S: b (a S)*

Discussion

Elimination of immediate left recursion leads to right recursion as we have seen in the second example. Although the transformation can be perfomed succinctly using Trails NFA representation of the grammar it still doesn't seem to be quite right. First of all a grammar rule is also always an interface description or a pattern in EasyExtend. Secondly we can use left recursion to express a different operator precedence as we do with right recursion. So the expressions are inequivalent. This might not be much of a problem in EasyExtend because it relies on Pythons Grammar which is LL(1) and expresses operator precedence just indirectly within a chain of rules. It is a problem for satellite langlets i.e. small associated langlets though.

From a design perspective I'm not entirely sure what to make out of it? Banning immediate left recursion? Issuing a warning that gets ignored by the user or just transform aggressively?

2008-04-02

Convergence and advancements

Fixing through advancing

The last two weeks I spent much time with bugfixes on the Trail implementation. Unfortunately I often see bugs only by advancing the whole system. So instead of moving slowly towards a stable release in a convergent process I seek new problem domains that challenge EE and introduce complexity on their own. The last of the many splinter projects that build a cloud around EE is a variant of E4X.

P4X

Among other things E4X deals with embedding of XML literals in ECMAScript languages. Creating XML literals in Python using EasyExtend is not yet feasible. Trail is simply not powerful enough to express the context sensitivity that arises from combining XML and Python in a seamless fashion. Instead a new concept starts to emerge which I call ParserHooks of satellite langlets. A P4X langlet delegates parsing of XML to another langlet which I called minxml. Again minxml can delegate parsing Python expressions ( when XML is used as a templating language ) to P4X. But this requires some adaption of the infrastructure. ParserHooks require a pattern, some registration mechanism and so on.

Prior to this design effort I struggled with moving definitions from the lexer to the parser and back. It didn't work too well but at least I found a serious bug in Trails algorithm that builds parse tree from NFAMultiTraces + subtrees. This algorithm was ugly and incomprehensible so I ripped it off and replaced it by a cleaner stack machine based approach.

The perils of expansion

These simple correctures are much unlike the elefant in the boat that shakes the whole system around. Trails expansion mechanism is the part of EasyExtend I find the hardest to understand. Not so much in the basic outline but in its implications. Expansion must somehow capture everything that deviates from a simple LL(1) grammar - and it fails. Two of the failures are well known and lead to warnings. These are LeftRecursion and RightExpansion. RightExpansion is always only detected after the expansion process and has zero influence on the expansion process. Issuing RightExpansion warnings is morally correct IMO. But although Trail terminates rule expansion when left recursion is detected it doesn't suffice to prevent hang ups. So I defined some artificial bounds similar to recursion limits that terminate the system with an exception. Unlike RightExpansion I don't think LeftRecursion is immoral and it shall be made possible. I made some drafts on modifying Trail to enable LeftRecursion but I'm scared about the additional complexity brought to some already badly understood algorithm.


2008-03-16

EasyExtend 3.0 - beta2 released!

Two weeks ago I released the first beta of EasyExtend 3.0. Now the changelog has been filled with bugfixes and improvements. I also re-integrated the fun langlet Teuton and the coverage langlet which finally contains a long promised functionality: it covers expressions in boolean operations s.t. it can detect that an expression 1/a in the boolean operation

                True or 1/a

is never executed.

Tokenization is still rather slow ( 9.000 - 10.000 chars/s on my 1.5 GHz notebook or ~220 LOC / s ) but I removed a serious performance killer in Trail that accidentally lead to exponential slow down of parsing in some cases. The transformer debug API is more complete. It's also simpler to pass extra arguments into node transformers.

What's left ?

doctest is not running properly. This indicates some signifcant variances in the behaviour of Trail and Pythons own parser. The coverage langlet needs a bit more polishing. However further Trail performance optimizations are beyond the scope of the current 3.0 release. Stabilization and testing have higher priority at the moment.

2008-03-04

EasyExtend 3.0 -beta1 released!

After more than half a year of work I released the first beta of EasyExtend 3.0 today. EasyExtend 3.0 is the second major redesign of EasyExtend. To gain more power and simplicity I implemented a new parser generator from the scratch called Trail. Trail unifies several aspects of EasyExtend. It is the base of both the EE 3.0 tokenizer and the parser. It is used to validate parse trees against grammars and it is used for parse tree synthesis.

Besides Trail other new aspects of EE 3.0 are
  • User defined file suffixes are recognized by the import machinery
  • Framework extension that supports facilitation of writing user defined tokenizers
  • Simplification of access to tokenizer and parser from individual extension languages ( langlets ).
  • Improved stability of the CST transformation process and improved debugging facilities
Currently EasyExtend 3.0 does not support several extension languages being described on the main page. Some of them like the macro langlet will be reimplemented in EasyExtend 3.1 using Trail techniques and a new persistence layer called exo.space. Others will be upgraded until the final release of EasyExtend 3.0.