Easy Extend


	Author: Kay Schluehr Date: 2008-03-03 Version: 0.1 E-mail: kay@fiber-space.de

Foreword

This is the documentation of the Trail parser generator. Trail is a parser generator for top down parsers based on a method I have called trace based parsing. Trace based parsing extends a parsing scheme that is known since Ken Thompsons work [1] of the late 60s about parsing regular expressions using nondeterministic finite atutomata (NFAs). It is used in Trail to parse languages using general context free grammars in EBNF format. For LL(1) grammars trace based parsing collapses to NFA driven LL(1) parsing with O(n) complexity on the length of the input token stream. For non left factored grammar rules Trail modifies LL(1) parsing by carefully embedding NFA transition tables into each other. This step is called NFA expansion.This way Trail achieves the power of advanced lookahead schemes or backtracking parsers with one token of lookahead. The effort of trace based parsing can grow considerably with the size of the expanded NFAs. Improved parsers using expanded NFAs remain to be investigated in the future.

Foreword
1. Introduction
2. Using Trail
- 2.1 Import NFALexer and NFAParser
- 2.2 Using Trail functions from langlets
3. Trail design fundamentals
4. Trace based parsers
5. Trace based lexers

1. Introduction

Top down parsers of context free languages generally suffer from following issues

Left recursion. A rule might be written in the style S: digit | S '+' S | S '*' S . A top down parser finding a digit has several choices to continue the parse. This can usually be avoid by transforming the rule into a right recursive style with various efforts.
Ambiguities. It is known that no algorithm is possible used to detect all ambiguities of a grammar. However detection schemes have been studied. Trail ignores this problematic completely. It doesn't use an intelligent disambiguation scheme either. So if you have a language with an ambigous grammar e.g. C and want to write a parser, Trail can be a pre-parser at best.
Left factoring. A grammar rule S might have the structure S = A S1 | A S2. We have to examine each branch in order to know whether a string matches against A S1 or A S2. This can become quite costly e.g. when S' = A* S1 | A* S2. An obvious solution in simple cases is to factor A* out: S' = A* ( S1 | S2 ). But then S1 and S2 still may have common FIRST_SETs and only through expansion of S1 and S2 we can reproduce the simple structure we observe in S or S'.

In practice left factoring in top-down parsing is the most pressing problem. Programmers run very frequently in a situation where they define a rule
S = A | B and where A and B can both match a terminal t. The most frequent strategies to solve this problem are

1) Backtracking.
    Here we check if A can fully match an initial section of the token sequence T = (t1, t2, ... ). If so a parse tree with A as the root node is yielded.
    Otherwise we checkout the same with B. If neither A nor B match T an error will be reported.
    Backtracking is very simple and universally applicable but it can be very costly.

2) Lookahead.
    This is best explained by Terrence Parr the creator of ANTLR that can look arbitrary deep into a token stream for disambiguation purposes.

A third method is used by Trail. It's a combination of two methods.

1) Parallel NFA traversal. A grammar is transformed into a NFA. The NFA will be traversed in parallel: when S = A B ... x | A B ... y both
    branches will be examined at the same time.

2) Expansion. When S = A | B, A=! B and A and B have common FIRST_SETs then A and possibly also B will be expanded into their
    subrule structures. Expansion will be repeated until S has the simple structure described in 1) or a cycle occurs that indicates left recursion.
    In that case expansion will be stopped with a warning.

2. Using Trail

Trail based lexers and parsers are implemented in the EasyExtend/trail package. One can directly import NFAParser and CFAParser objects. However it is required to pass a langlet argument into NFALexers and NFAParsers and for langlets there exist simpler ways to use parsers and lexers.

2.1 Import NFALexer and NFAParser

>>> import EasyExtend.langlets.zero.langlet as langlet
>>> from EasyExtend.trail.nfalexer import NFALexer, TokenStream
>>> lexer = NFALexer(langlet)
>>> lexer.scan(TokenStream("1+2"))
[[(258, 'NUMBER'), '1', (1, 1), (0, 1)], [(270, 'PLUS'), '+', (1, 1), (1, 2)], [(258, 'NUMBER'), '2', (1, 1), (2, 3)]]
>>>

The input TokenStream is just a wrapper around the source string. The idea of passing a token stream into a lexer might feel strange but the lexer is just another Trail parser with some special properties.

In the EasyExtend framework the output token stream of the scan() method is not be passed immediately to the Trail parser. Instead it has to run through a filtering or postprocessing step. In this process token for significant whitespace are created for Python. The specific or context sensitive part of lexical analysis is implemented in eetokenizer.py and can be overridden.

For doing a proper analysis we have to do the following steps

>>> from EasyExtend.eetokenizer import Tokenizer
>>> tokenizer = Tokenizer(langlet)
>>> tokenizer.tokenize_string("def f():\n print 77")
[[1, 'def', 1, (0, 3)], [1, 'f', 1, (4, 5)], [7, '(', 1, (5, 6)], [8, ')', 1, (6, 7)], [11, ':', 1, (7, 8)], [4, '\n', 1, (8, 8)], [5, ' ', 2, (0, 2)], [1, 'print', 2, (2, 7)], [2, '77', 2, (8, 10)],
[6, '', 3, (0, 0)], [0, '', 4, (0, 0)]]

Among other things this tokenstreams contains the token [5, ' ', 2, (0, 2)] which represents an INDENT token and [6, '', 3, (0, 0)]
which is the complementary DEDENT.

2.2 Using Trail functions from langlets

If you are acting inside the context of a langlet Trail is already actively used. However you can also access it explicitely e.g. for test purposes:

c:\lang\Python25\Lib\site-packages\EasyExtend\langlets\zero>python run_zero.py
__________________________________________________________________________________

zero

On Python 2.5.1 (r251:54863, Apr 18 2007, 08:51:08) [MSC v.1310 32 bit (Intel)]
__________________________________________________________________________________

>>>
>>> parse("zero")
[258, [326, [303, [304, [305, [306, [307, [309, [310, [311, [312, [313, [314, [315, [316, [317, [1, 'zero', 1]]]]]]]]]]]
]]]]], [0, '', 2]]
>>> tokenize("def f():\n print 77\n")
[[1, 'def', 1, (0, 3)], [1, 'f', 1, (4, 5)], [7, '(', 1, (5, 6)], [8, ')', 1, (6, 7)], [11, ':', 1, (7, 8)], [4, '\n', 1, (8, 8)], [5, ' ', 2, (0, 2)], [1, 'print', 2, (2, 7)], [2, '77', 2, (8, 10)], [4, '\n', 2, (10, 10)], [6, '', 3, (0, 0)], [0, '', 4, (0, 0)]]
>>> parse("def f():\n print 77\n")
[257, [266, [291, [261, [1, 'def', 1], [1, 'f', 1], [262, [7, '(', 1], [8, ')', 1]], [11, ':', 1], [299, [4, '\n', 1], [5, ' ', 2], [266, [267, [268, [271, [1, 'print', 2], [303, [304, [305, [306, [307, [309, [310, [311, [312, [313, [314,[315, [316, [317, [2, '77', 2]]]]]]]]]]]]]]]]], [4, '\n', 2]]], [6, '', 3]]]]], [0, '', 4]]
>>>
>>> pprint(parse("def f():\n print 77\n"))

file_input -- S`257 -- 257
stmt -- S`266 -- 266
    compound_stmt -- S`291 -- 291
      funcdef -- S`261 -- 261
        NAME -- T`1 -- 1     L`1
          def
        NAME -- T`1 -- 1     L`1
          f
        parameters -- S`262 -- 262
          LPAR -- T`7 -- 7     L`1
            (
          RPAR -- T`8 -- 8     L`1
            )
        COLON -- T`11 -- 11     L`1
          :
        suite -- S`299 -- 299
          NEWLINE -- T`4 -- 4     L`1
            '\n'
          INDENT -- T`5 -- 5     L`2

          stmt -- S`266 -- 266
            simple_stmt -- S`267 -- 267
              small_stmt -- S`268 -- 268
                print_stmt -- S`271 -- 271
                  NAME -- T`1 -- 1     L`2
                    print
                  test -- S`303 -- 303
                    or_test -- S`304 -- 304
                      and_test -- S`305 -- 305
                        not_test -- S`306 -- 306
                          comparison -- S`307 -- 307
                            expr -- S`309 -- 309
                              xor_expr -- S`310 -- 310
                                and_expr -- S`311 -- 311
                                  shift_expr -- S`312 -- 312
                                    arith_expr -- S`313 -- 313
                                      term -- S`314 -- 314
                                        factor -- S`315 -- 315
                                          power -- S`316 -- 316
                                            atom -- S`317 -- 317
                                              NUMBER -- T`2 -- 2     L`2
                                                77
              NEWLINE -- T`4 -- 4     L`2
                '\n'
          DEDENT -- T`6 -- 6     L`3
            ''
ENDMARKER -- T`0 -- 0     L`4
    ''

3. Trail design fundamentals

In the initial and preparational step we translate grammar rules into nondetermninistic finite automata ( NFAs ). Those NFAs have a simple visualization as syntax diagrams.

Example:

The following grammar rule describes the argument list for a function call in Python 2.5

arglist

                

::=

(argument
',')* (argument [',']| '*' test [',' '**' test] | '**' test)

Below is the corresponding syntax diagram:

( Image 1: generated with syntax diagram generator being available here ).

Each node in the syntax diagram corresponds either with a terminal or nonterminal of the language grammar. For each of those terminals/nonterminals Trail uses a numerical encoding [2].

For the graph above we have following encoding scheme:

Token/Symbol name	Python 2.5 numerical code
arglist	329
argument	330
test	303
',' ( COMMA )	12
'*' ( STAR )	16
'**' ( DOUBLESTAR )	36

We call the numerical codes node ids. A trace through the NFA corresponds to a valid trace through the diagram. This can always be represented by a sequence of symbols of the diagram.We chose None as an explicit symbol of trace termination.

Example:

Symbol sequence	Corresponding node id sequence
argument ',' argument None	330 12 330 None
argument ',' '**' test None	330 12 36 303 None
'**' test None	36 303 None

3.1 Labels

Different instances of COMMA, test or argument are mapped onto different locations in the syntax diagram. Just using node ids isn't sufficient to encode those different locations. We use an arbitrary indexing scheme instead. In the diagram we see three different instances of test at three different locations. We can just enumerate them and create tuples (303, 1), ( 303, 2), (303, 3). The index has no significance other than distinguishing those node ids from each other. The tuples are called node labels or labels for short. A special label is used for None. None is the unique exit symbol and the corresponding label is (None, '-').

3.2 Transitions

Each transition in the diagram is encoded in terms of labels. Each label can have one or more follow labels. The only exception is the exit label (None, '-') which is the only one that has no follow labels and must be present at least one time. A complete translation of the syntax diagram into label transitions is shown below.

"arglist: ( argument ',' ) * ( argument[ ',' ] | '*' test[ ',' '**' test ] | '**' test )",
       (329, 0),
       {(12, 2): [(16, 5),
                  (330, 3),
                  (330, 1),
                  (36, 10)],
        (12, 4): [(None, '-')],
        (12, 7): [(36, 8)],
        (16, 5): [(303, 6)],
        (36, 8): [(303, 9)],
        (36, 10): [(303, 11)],
        (303, 6): [(12, 7), (None, '-')],
        (303, 9): [(None, '-')],
        (303, 11): [(None, '-')],
        (329, 0): [(16, 5),
                        (330, 3),
                        (330, 1),
                        (36, 10)],
        (330, 1): [(12, 2)],
        (330, 3): [(12, 4), (None, '-')]}],

The label (329, 0) is the start label. Each trace through the diagram can be expressed as a sequence of labels beginning with (329, 0) and ending with (None, '-').

Trail translates each grammar rule into such a nfa label transition table.

3.3 Extending Labels

We will later introduce expansion techniques for NFA label transition tables. This might eventually lead to embedding of one NFA transition table into another one. In those cases the distinctions we just made between node ids in one NFA are not sufficient. Instead we have to make each node id unique in the complete set of NFAs. We do this by extending the labels. This is done by just appending the node id of the start label of the NFA to each of the labels. In case of arglist the node id of the start symbol was 329 and we create new labels

(329, 0, 329), (330, 1, 329), ( 330, 3, 329), (36, 10, 329), ...

from the old ones.

Note that (None, '-') loses its uniqueness and becomes extended as well. The exit symbol of one NFA doesn't correspond with the exit symbol of another one. So it makes sense to write (None, '-', 329).

The new NFA transition tables looks like this:

"arglist: ( argument ',' ) * ( argument[ ',' ] | '*' test[ ',' '**' test ] | '**' test )",
       (329, 0, 329),
       {(12, 2, 329): [(16, 5, 329),
                       (330, 3, 329),
                       (330, 1, 329),
                       (36, 10, 329)],
        (12, 4, 329): [(None, '-', 329)],
        (12, 7, 329): [(36, 8, 329)],
        (16, 5, 329): [(303, 6, 329)],
        (36, 8, 329): [(303, 9, 329)],
        (36, 10, 329): [(303, 11, 329)],
        (303, 6, 329): [(12, 7, 329), (None, '-', 329)],
        (303, 9, 329): [(None, '-', 329)],
        (303, 11, 329): [(None, '-', 329)],
        (329, 0, 329): [(16, 5, 329),
                        (330, 3, 329),
                        (330, 1, 329),
                        (36, 10, 329)],
        (330, 1, 329): [(12, 2, 329)],
        (330, 3, 329): [(12, 4, 329), (None, '-', 329)]}],

4. Trace based parsers

4.1 A simple parse tree checker using label traces

We can now generalize simple node id sequences representing sequences of terminal + nonterminal sequences to label traces:

Symbol sequence	Corresponding label trace
argument None	(329,0,329) (330,3,329) (None,'-',329)
argument ',' None	(329,0,329) (330,3,329) (12, 4, 329) (None, '-', 329)
argument ',' argument None	(329,0,329) (330,1,329) (12, 2, 329) (330,3,329) (None, '-', 329)

When projecting the label traces onto node id sequences we get:

Label trace	Node id projection
(329,0,329) (330,3,329) (None,'-',329)	329 330 None
(329,0,329) (330,3,329) (12, 4, 329) (None, '-', 329)	329 330 12 None
(329,0,329) (330,1,329) (12, 2, 329) (330,3,329) (None, '-', 329)	329 330 12 330 None

This gives rise to the following simple idea for checking the correctness of a parse tree in some grammar G. A parse tree is always a recursively defined list of nodes. A node N representing a non-terminal has the structure N = [n, N1, N2, .... Nk] where n is the node id of N and Ni are subsequent nodes. In order to check the structure it suffices to check N° = [n, n1, ..., nk] the node id sequence of N and each Ni recursively with terminals at the bottom of the recursion.

How to check N° ? We start by selecting the NFA corresponding to n and the start label S = (n, 0, n).
Let T0 = {n01,n02,...,n0r} be the label transitions of S. We keep the set Sel0 of projections of the labels onto node ids. If n1 is not in Sel0 we know that N has a corrupted structure in G. Otherwise we select the subset P01 of all labels in T0 projecting onto n1. From the labels in P01 we create the set of all label transitions T1 = {n11,n12,...,n1s}. We repeat the ptocedure now with n2 and T1. The checking processes is finished when we either reach node id nk or Ti is {(None,'-',n)} for some i.

4.2 A first trace based parser

A simple trace based parser is just slightly more complicated than the parse tree checker we described in the last subsection. We start with a token sequence tokenstream = [t1, ...., tn] being the result of lexical analysis and an NFA N0 representing a top level rule of the grammar.

We apply the following parsing algorithm:

    def parse(tokstream, nid, token):
        nfa_trans = NFATracer(nid)                   # select a NFATracer object for iterating through
                                                        # label transition sets of NFA[nid]

        selection = nfa_trans.get_selection(nid)       # get first selection of nids according to the
                                                        # label transitions of (nid, 0, nid)

        parse_tree = [nid]                              # store nid as the top-level node in parse_tree

        while token:                                    # as long as token are available in the tokenstream...
            for s in selection:                         # iterate through node ids s of the selection
                if s is not None:
                    if is_token(s):                     # if s is the node id of a terminal symbol and
                        if token_type(token) == s:      # it corresponds with the node id of the current token
                            tokstream.shift_right()     # increment the read position of the tokenstream
                            parse_tree.append(token)    # and add the token to the parse tree. We get
                            break                       # parse_tree = [nid, ..., token] and terminate
                                                        # the for loop.

                    elif token_type(token) in first_set(s):   # if s is the node id of a non-terminal and
                                                              # if the node id of the current token is in the
                                                              # first_set of s we call parse recursively
                        P = parse(tokstream, s, token)    # with s. The result is a parse tree P
                        if P:                                 # if P is not None add the parse tree to
                            parse_tree.append(res)            # parse_tree. We get parse_tree = [nid, ..., P]
                            break
            else:
                if None in selection:                   # there has not been a successfull non-None selection
                    return parse_tree                   # ... but that's o.k. if None is selectable
                else:
                    raise SyntaxError                   # otherwise we failed to parse the tokenstream
            selection = nfa_trans.get_selection(s)     # everything is o.k. now and we can demand the next
                                                        # selection in the current NFA from s
            token = tokstream.get_token()               # keep the next token. If there is no token left we
        return parse_tree                               # are done

Note that get_selection(nid) works as described for the syntax checker. There will always be a set L of lables stored by the NFATracer that represents the internal state of the iteration through the NFA label transitions. When get_selection(nid) is called following steps are made:

determine the set S of all labels in L with L[0] == nid.
get the transitions of each label in S and merge them to a new label set L_next.
build new selection = { label[0] for label in L_next} and return it

4.3 NFA expansions

In the algorithm described in 3.2 we made a significant simplification. Once we have found a non-terminal containing the current token in its first_set we call parse() again and if we can't parse the tokenstream successfully we raise an error. It's not foreseen that another s in the selection could be succefully applied. Indeed we demand that the first_sets of the node ids in the selection are pairwise disjoint. Seemingly this gives us little more than just an LL(1) parser. But this impression is wrong.

That's because rules of the kind

S: A B C ... i | A B C ... j | A B C ... k | ...

can be handled without problems. Although each A has a unique label and defines an own trace all labels are projected onto the node id of A.

Suppose now we modify S:

S1: A1 B C ... i | A B C ... j | A B C ... k | ...

with A1 != A. The first_sets of A and A1 shall intersect. We can substitue A1 and A by the productions defined on their right-hand-side. After finitely many recursive expansions we reach one of two different possible states:

The expansion process runs into a cycle and can't be terminated.
We achieve the rule form of S with disjoint selections.

If we run in case 1. a warning will be flagged. This warning shall be understood as a rejection of the grammar. In those cases Trail shall not be used and the grammar shall be modified. This happens generally with left recursive rules in grammars. Those can be somewhat hidden. We will give an example at the end of the section.

The other more pressing problem is bookkeeping. We can't simply replace A by its RHS in S because we need the node id of A in the parse tree. So we need to keep track of A during expansion.

4.3.1 Tracking of nodes during expansion

Keeping track of A when being expanded in S is quite cheap. A and S are represented by their NFAs and a reference to A is in each label
(k, i, A ) of A and a reference to S is in each label (l, j, S ) of S. Since the labels are global data for a grammar G they can't ever be confused. What's left is the treatment of the exit label (None, '-', A ) of A on embedding of A in S.

Embedding rules:

Let (A, k, S ) be the original label, we transform it into a new label ( A, '.', k, S ). The dot argument indicates that this label shall be skipped when selections are built. Instead on selection built it is used by Trail to forward to its follow transition set. If the follow transitions contain a dotted label the forwarding will continue.
The dotted label ( A, '.', k, S ) replaces (None, '-', A ) in the transition set of A.

With those two rules we have all data being required to keep track on A as well as a consistent embedding of A. The description of rule 1. already mentions forwarding and forwarding makes parsing somewhat more complicated because we cannot just build the parse tree within the parse() function like we did above. That's because the dotted labels which are skipped contain the necessary embedding information not being available in the nid selections. The solution we apply is to externalize the construction of the parse tree completely. In the above algorithm the parse tree construction was expressed by simple list definitions and extensions. We replace them by calls to so called NFAMultiTrace objects.

4.3.2 NFAMultiTrace objects

If anything in Trails design justifies the label trace based parsing it is the use of NFAMultiTrace objects. A NFAMultiTrace object or short MTrace is a tree that grows in intervals.

In the graphic below the growth principle is illustrated. Suppose A was selected and the algorithm calls get_selection(A). The immediate follow transitions are { (B, '.', 1, D), (B, '.', 1, C)}. The dotted labels need forwarding. The label (B, '.', 1, D) is forwarded to {(F, 6, D)} while (B, '.', 1, C) if forwarded to a follow transition { (C, '.', 3, D) } which needs another forwarding that yields { (E1, 4, D) , (E2, 5, D) }. The MTrace can grow now at any of the labels { (F, 6, D), (E1, 4, D) , (E2, 5, D) }. The provided nid selection is (F,E) and the algorithm selects F. Once F is selected the MTrace can only grow in (F, 6, D). The available transition becomes {(G, 7, D)}.

Suppose we reached the node (K, 10, 9) in our very last transition. With (K,10, 9) we move along the parental chain back to (A, 0, B) and get
( (K, 10, 9, S), (D, '.', 8, S), (G, 7, D), (F, 6, D), (B, '.', 1, D), (A, 0, B) ). This flat trace can now be read in reverse order and converted into a sequence of node ids which yields the final result (B A D F G S K).

4.3.3 Drawbacks

Expansion can blow up the size of a single NFA notably. Below the number of key labels of transitions are listed against the nodes that are expanded.

Node	Nbr of Key Labels
unit	207
NUMBER	123
OPERATOR_DEF	86
Imagnumber	62
Floatnumber	40
STRING	26
Intnumber	15

The problem with current naïve implementation of Trail is that each of those key labels can be a branch point in a NFAMultiTrace object. None of these NFAMultiTraces grows strongly but the unit gets a hit quite often. Also the selections are reconstructed each time. So we need caching and access optimization techniques to reduce complexity of Trail.

4.4 Features and Limitations of Trail

Notation. When we want to embedd a rule A : RHS into another one S : A B ... we put A in curlys i.e. S : {A : RHS} B ...

4.4.1 Dangling Else

The dangling-else rule is famous for its ambiguity:

Stmt: 'if' Expr 'then' Stmt | 'if' Expr 'then' Stmt 'else' Stmt

Suppose you expand the second branch of Stmt in the first branch...

S: 'if' Expr 'then' {Stmt: 'if' Expr 'then' Stmt 'else' Stmt }

and compare this by expansion of the first branch of Stmt in the first Stmt of the second branch

S: 'if' Expr 'then' {Stmt: 'if' Expr 'then' Stmt } 'else' Stmt

How does Trail deal with it? Trail will always prefer the first version. Only when Stmt after 'then' is fully parsed Trail will examine whether it has to proceed ( move into the 'else' branch ) or terminate.

4.4.2 Case study : Django templates

Djangos template language syntax implements what I call extended braces. The following example is kept from Djangos template documentation:

{% for story in story_list %}
<h2>
  <a href="{{ story.get_absolute_url }}">
    {{ story.headline|upper }}
  </a>
</h2>
<p>{{ story.tease|truncatewords:"100" }}</p>
{% endfor %}

We attempt to encode this into the following rule...

for_template_stmt: '{%' 'for' expr 'in' expr '%}' stmt* '{%' 'endfor' '%}'

...and add two other rule descriptions:

stmt: if_template_stmt | for_template_stmt | html_stmt
if_template_stmt: '{%' 'if' test '%}' stmt* '{%' 'endif' '%}'

Can you see the issue? When stmt* is entered Trail can either continue parsing along '{%' 'endif' '%}' or it starts to parse another stmt that might start with '{%' as well. Expansion leads to disambiguation here but unfortunately we run into a cycle because we expand if_template_stmt within itself.

5. Trace based lexers

Prior to version 3.0 EasyExtend used a regular expression based tokenization process. Actually EE's lexical analysis was just a somewhat advanced modular version of stdlibs tokenizer.py algorithm. The token defining parts as well as the regular expression building were splitted into different modules. Even in EE 3.0 the basic tokenization algorithm ( generate_tokens ) has just slightly changed. It is still needed for the context sensitive parts of the Python language: dealing with significant whitespace, line continuations, comments and all other stuff that shall be ignored. However in EE 3.0 regular expression matching has been dropped and Trail is used instead.

5.1 Dropping regular expressions

Building a regular expression for pattern matching is conceptually simple. Just define a set regular expressions {p₁, p₂, ..., p_n} - one for each token to be matched. Then concatenate those to build a single big pattern p = p₁| p₂| ... |p_n. Now you can loop over a string, match string sections against p and do some postprocessing. The drawback of this approach using Pythons regular expression engine is that the result is not invariant under permutations of the pattern p_i. In particular a simple transposition p' = p₂| p₁| ... |p_n can change the output dramatically.

Example: suppose you have defined a pattern that recognizes floating point numbers

FLOAT: DEZIMAL '.' [DEZIMAL]

and another pattern that defines IPv4 addresses

IPv4: DEZIMAL '.' DEZIMAL '.' DEZIMAL '.' DEZIMAL

When you encode those patterns as regular expressions it matters whether we execute IPv4|FLOAT or FLOAT|IPv4. In the first case IP addresses will be matched, while in the second case it will be splitted into two floats and will never be matched.

5.2 Introduction to Litsets and Pseudo-Token

When we deal with the Lexer as just another parser the natural question about the terminals of the Lexer arises. What are the TokenToken and how do they fit into the parser architecture including NFA expansions?

5.2.1 Litsets

At the bottom level we have character sets and encodings. For sake of simplicity the first version of the Trail Lexer supports ASCII characters only for encoding identifiers ( following Python 2.5 and not Python 3.0 ). So we use them to encode e.g. decimal or hexadecimal digit characters, but also whitespace etc.

Litset	Definition
LS_A_LINE_END	set(map(chr,[10, 13]))
LS_A_CHAR	set(string.ascii_letters+"_")
LS_A_WHITE	set(map(chr,[9, 10, 11, 12, 13, 32]))
LS_A_HEX_DIGIT	set(string.hexdigits)
LS_A_OCT_DIGIT	set(string.octdigits)
LS_A_NON_NULL_DIGIT	set('123456789')
LS_A_DIGIT	set(string.digits)
LS_A_BACKSLASH	set(['\\'])
LS_ANY	set()

Other characters being used occur explicitely in the Token file. They play the same role as keywords in the Grammar file.

5.2.2 Pseudo-Token

Litsets are mapped one-one onto a a class of so called pseudo-token e.g. LS_A_WHITE onto A_WHITE, LS_A_ANY onto ANY etc. There is no such thing as an initial token stream created by a LexerLexer or Sublexer. So pseudo-token are token as-if. In case of lexical analysis a pseudo-token simply matches a character when the character is in the corresponding litset. An exception is made with ANY. ANY is characterized by the empty set but that's because we can't express the notion of a set of all characters in a comprehensive way. ANY is the only pseudo-token that is handled explicitely in the parsing process. We will discuss rules for ANY later.

There is also a second class of pseudo-token that have no matching abilities and corresponding character sets. Some of them are just present for compliency with Pythons grammar but others are created explicitely in a post lexing step. Those are pseudo-token like T_INDENT, T_DEDENT, T_ENDMARKER
or T_NEWLINE. As a general rule A_ pseudo-token are used to match characters and T_ pseudo-token are for token streams being accepted by the parser.

5.2.2.1 pseudo-reachability

The implementation of NFAParser contains methods is_token and is_reachable. The first one simply checks whether a token is a terminal and the latter checks if a symbol s is in parse_nfa.reachable[u] for some other symbol u. These functions are substituted in the lexer by is_pseudo_token and is_pseudo_reachable. The first one checks whether a symbol s is a key in the lex_nfa.pseudo_token dictionary whereas the latter has following implementation

    def is_pseudo_reachable(self, c, nid):
        pseudo_token = self.pseudo_reachable.get(nid)
        if pseudo_token is None:
            reach = self.lex_nfa.reachables.get(nid,[])
            pseudo_token = [t for t in reach if self.is_pseudotoken(t)]
            self.pseudo_reachable[nid] = pseudo_token
        for t in pseudo_token:
            if c in self.lex_nfa.pseudo_token[t]:
                return True
        return False

The pseudo_reachable dictionary contains key-value pairs with nids as keys and pseudo-token that are reachable from those nids as values. The function iterates through the possible pseudo-token of the nid and checks if the character parameter c is in the litset of one of those pseudo-token. If so it returns True otherwise False.

5.2.3 Pseudo-Token expansion

What happens when a character c is in the litset of two or more pseudo-token in the previous algorithm? First come, first go. But this situation shall not happen because it indicates that Trail has to select between different alternative branches which are all promising - something Trail isn't constructed for ( no backtracking, no more than 1 token of lookahead ). So we have to make litsets of pseudo-token disjoint with respect to the situation where a collision happens. This doesn't imply that the litsets in lex_nfa.pseudo_token were all disjoint.

Example: in the following rule expression A_CHAR X | A_HEX_DIGIT Y we have a collision between the alphabetic characters used in the A_CHAR litset and the same characters in the A_HEX_DIGIT litset. So we create three sets: S1 = A_CHAR /\ A_HEX_DIGIT, S2 = A_CHAR - A_HEX_DIGIT and
S3 = A_HEX_DIGIT - A_CHAR. Now we can replace the above rule expression by

S1 X | S1 Y | S2 X | S3 Y

This disambiguation scheme suggests the following form of pseudo-token expansion:

{A_CHAR:S1} X | {A_HEX_DIGIT:S1} Y | {A_CHAR:S2} X | {A_HEX_DIGIT:S3} Y

5.2.4 Rules for ANY

ANY interesects with everything and we have to take additional care for values to the right of sequences of type ANY* or ANY+. When matching symbols ANY has always the weakest position among all characters and pseudo-token. So we state following rule:

Rule for matching with ANY: let S: ANY | B be a rule and c a character. If B can match c then B will be selected as a matching rule.

Notice that this includes also None as a matching rule. As a consequence a rule S: ANY* would not match anything because the rule can always be terminated.

5.3 Can your parser generator do this?

This rule is used to parse all variants of triple quoted strings with double quotes " being possible in Python.

>>> """abc"""
'abc'
>>> """abc"def"""
'abc"def'
>>> """abc"def"geh"""
'abc"def"geh'
>>> """abc"def""geh"i"""
'abc"def""geh"i'

[1] Recently I found a nice article written by Russ Cox about matching algorithms for regular expressions which refers to an implementation method introduced by Ken Thompson and published in 1968. The basic idea is very similar to that used by Trail.

[2] When Trail parses Python these numerical encodings correspond to those being found in symbol.py and token.py of the standard library.