Table Of Contents

Previous topic

reverb.py

This Page

The reverb module

Getting Started

All of the following examples assume that you’ve done a:

from reverb import *

Since this will put unknown junk in your namespace, in a real program you’d probably import only the names you actually need, or use ‘import reverb’ and prefix all the names shown here with ‘reverb.’.

Regular expressions

A regular expression or Regexp specifies a set of strings that match it.

Text

The most basic form of a Regexp matches only one particular string

Text('spam')    # Matches 'spam', and nothing else

No characters in the string have any special meaning, except as defined by Python’s syntax for string literals (for example, if you wanted to match a backslash, you’d need to double it, or use a raw string).

In most cases, you can simply specify a string: the Text() function is only strictly needed when both sides of an operator would otherwise be plain strings (since Python has no way of redefining operators in that case: at least one side has to be a class instance, such as returned by Text()).

In order to make the following examples more meaningful, I’ll define one additional Regexp form here

Digit   # Matches any one character from '0' to '9'

Regexps can be combined in various ways to form more complicated expressions.

Concatenation

A + B means a match of A immediately followed by a match of B

Text('spam') + Text('eggs')   # Matches 'spameggs'

Note that in this example, you could leave off either of the calls to Text(): plain strings automatically have Text() applied to them when used in an expression with any Regexp form. In fact, you could leave off both Text(), since Python’s definition of ‘+’ on strings happens to match the definition for Regexps.

You can also express concatenation as A & B

'spam' & Digit  # Matches 'spam0' thru 'spam9'

Alternatives

A | B means a match of either A or B

Text('spam') | Text('eggs')  # Matches 'spam' or 'eggs'

Again, you could leave off either call to Text(). However, in this case you couldn’t omit both calls, as there’s no way to assign a meaning to ‘spam’|’eggs’ in Python. ‘|’ is of lower precedence than concatenation: you can use parentheses as usual to force a different order of evaluation

'spam' | Digit + Digit  # Matches 'spam' or a 2-digit number

Negation

-A matches anything except what matches A, but not all Regexp forms support this

-Digit   # Matches any one character that's not '0' thru '9'

The Regexp forms which support negation will indicate so in their description.

Simple text matches are NOT negatable: this could be implemented, but would result in a very inefficient Regexp (in most cases, you could use -FollowedBy(str), which will be described later). As you might expect, A - B means A + -B

'spam' - Digit  # Matches 'spamX', etc., but not 'spam1' or 'spam'

Lookahead and Lookbehind

FollowedBy(A) matches at any point where the following text would match A, however that text doesn’t become part of the matched range

'spam' + FollowedBy(' ')    # Matches the first 'spam' in
                            # 'spam spam', but NOT the second

Lookaheads can be negated:

'spam' - FollowedBy(' ')    # Matches only the last 'spam' in
                            # 'spam spam spam'

Character Sets

A match for any character in a range can be written as

Char(A).to(B)

A and B must evaluate to 1-character strings. This form can be negated, matching any character not in the range. More complicated character sets can be built up with

Set(<items>)

where <items> is a comma-separated list of any of the following elements:

  • A string of 1 or more characters includes those characters in the set.
  • An integer includes the character with that code (as determined by chr()).
  • Certain basic Regexp forms that match a single character, such as digit (described earlier), can be included. Eligible forms will mention this in their descriptions.
  • A Char().to() expression, as described above, includes the characters in the specified range in the set.

Sets can be negated (but not the individual items making up the set), resulting in a set that will match any character other than the ones specified.

Example

Set(Digit, Char('A').to('F'), Char('a').to('f'))  # Matches any hexadecimal digit

Hint: the standard string module contains several predefined strings, such as lowercase and octdigits, that you may find useful as the basis for a set. The example above could be more compactly written as Set(string.hexdigits).

Warning

The underlying Regexp implementation specially interprets the characters ‘-‘, ‘^’, ‘]’, and ‘’ within character sets. The current version of reverb does NOT attempt to work around these special interpretations, and may therefore produce an invalid Regexp if any of these characters are used in a set. Here are some work-arounds:

'-' or ']': must be the first character in the set. If you need both, you lose.

'^': must not be the first character in the set.

'\': specify this as R’\’.

Repeats

There are several forms for specifying that a Regexp can or must match multiple times. The most general is

Repeated(expr, min, max = None)

which will match the expr at least min times (which may be 0), up to max times (unlimited, if max is None or omitted). You may specify the min and max parameters either positionally, or as keyword arguments to make it clearer what they’re for. The other repeat forms are actually special cases of repeated()

Optional(expr, max = None)  # 0 or more; Repeated(expr, 0, max)
Required(expr, max = None)  # 1 or more; Repeated(expr, 1, max)
Maybe(expr)                 # 0 or 1; Repeated(expr, 0, 1)

Repeat forms normally use a “greedy” strategy: they initally match as many copies of the Regexp as possible, up to the specified maximum. If this results in the rest of the overall Regexp failing to match, they will try matches of fewer and fewer copies (down to the specified minimum number), in an attempt to make the complete Regexp match successfully.

Sometimes this produces undesired results, and for those cases you can request a “non-greedy” strategy by adding .nongreedy or .minimal to the repeat form. A non-greedy repeat will initially match the specified minimum number of copies of the Regexp, but will try matching additional copies (up to max) if needed to make the complete Regexp match. An example will follow, but first another basic Regexp form needed to understand it

Any   # Match any one character (exact details given later)

Given the string ‘[ab] [cd] [ef]’:

'[' + Optional(any) + ']'           # Matches the entire string!  Oops...
'[' + Optional(any).nongreedy + ']' # Matches '[ab]'

Grouping

It is possible to assign a label to a portion of a Regexp for later reference. This can be used to extract multiple items of data from a string at one time (rather than using several separate Regexp matches), for fancy search & replace operations, and for verifying that the same text appears at more than one place within the string

Group(expr, name = None)    # Match expr, and label the matched text

If a name is specified, it must be a string following the normal rules for a Python identifier: a letter followed by zero or more letters or digits. Even if no name is given, the Group can be referenced by its position, starting with 1 for the leftmost Group() in the complete Regexp.

Backmatching

To the right of a labelled group, it is possible to specify a match of the exact text that the group matched (which might not be the only text that the group’s Regexp was capable of matching)

Match(name) # Match whatever the preceding group(expr, name) did

Example

the following will match any of Python’s formats for string literals, single or triple quoted, except for handling of backslash escapes

delim = Text('"') | Text("'") | Text('"" "') | Text("'''")
# Note: the third text() above shouldn't have a space in it, but I had
# to split it up to keep from prematurely ending the module docstring.
Group(delim, 'q') + Optional(any).nongreedy + Match('q')

Note that the backmatch requires that the starting and ending quote marks be exactly the same: a Regexp like delim+Optional(any)+delim would match strings that start with a single quote and end with a double quote, for example.

Basic Regexp Forms

Text(str)

Match str exactly. Can also be used on an arbitrary Regexp form (in case you’re passed a parameter which may either be a Regexp or a plain string), in which case that Regexp is returned unchanged.

Any

Match any one character, except for a newline. If the DOTALL flag is specified (more on flags below), will match a newline as well.

Start End

Match at the beginning and end of the string, respectively. If the MULTILINE flag is specified, also matches the beginning and end of each line within the string.

StartString EndString

Match only at the beginning and end of the string, respectively. Only needed if the MULTILINE flag is set, otherwise they’re the same as start & end.

Wordbreak

Matches at the boundary between an alphanumeric character and a non-alphanumeric character: in other words, at either end of a word. Can be negated.

Digit

Matches any decimal digit (0-9). Can be negated, and used in character sets.

Digits

A synonym for digit, which looks better in some contexts.

Whitespace

Matches space, tab, newline, and other whitespace characters. Can be negated, and used in character sets.

Alphanum

Matches any letter, digit, or underscore. If the LOCALE flag is specified, also matches any additional characters that are defined as letters in the current locale, such as accented vowels. Can be negated, and used in character sets.

Raw(str, priority)

Allows a raw Regexp to be passed through to the underlying re module. Specify a priority of 99 for an atomic Regexp (a single character, or a parenthesized form), or 0 if the Regexp needs to be parenthesized to ensure proper precedence if any operators are applied to it.

Using Regexps

To actually match strings against any of the Regexp forms defined here, the form must be compiled:

Re(expr, flags = 0)

The result is a compiled RegexObject, as defined in the re module, and can be passed to functions or have its methods called as defined in that module. See the Python Library documentation for details. However, you don’t need to explicitly import re: everything from re is imported into reverb, so you can grab whatever definitions you need as part of your import from reverb.

Flags

The optional flags parameter can be specified as any of the following (use ‘|’ to combine multiple values)

I
IGNORegexpCASE

Pattern matching is made case-insensitive.

L LOCALE

Use locale-specific definitions for the alphanum & wordbreak Regexp forms.

M MULTILINE

Allow start & end to match in each line within the string, as well as the string as a whole.

S DOTALL

Allow the any form to match absolutely any character, even a newline.

When using the re module’s sub and subn functions/methods, there’s a helper function in reverb for specifying replacement text that references a group:

Matched(group)  # Returns reference to group (by name or position)

The result is just a string, that should be concatenated (by ‘+’, for example) to any other replacement text. Note the subtle difference here: Match() matches a group in the Regexp itself, Matched() references a group in the replacement text. There unfortunately didn’t seem to be any way of writing one function that would work in both places.

Unimplemented Features

The following capabilities of the re module have intentionally been omitted:

  • The X/VERBOSE flag, since the whole point of reverb is to make it possible to write really verbose Regexps.
  • The ability to embed comments in a Regexp with (?#...), since you can just as easily use normal Python comments now.
  • The ability to embed flag settings inside a Regexp.

This was done since reverb allows (and even encourages) splitting Regexps into multiple subexpressions: embedded flags currently affect the entire Regexp, and would cause unexpected changes in the behavior of subexpressions that don’t contain the embedded flag.

Changes

reverb 2.0.1

  • Bugfix: typo removed
  • Bugfix: file permissions changed s.t. applying tar on the reverb-2.0.1.tar.gz package is going to work.

( thanks to Mike Salib for reporting the bugs )

reverb 2.0

Author: Kay Schluehr ( kay@fiber-space.de )

  • Capitalize names of pattern
  • Separate tests into own module.
  • Create Sphinx docs from docstring documentation in original reverb.py module.

reverb 1.0

Author: Jason Harper ( JasonHarper@pobox.com )

Initilal release.