All of the following examples assume that you’ve done a:
from reverb import *
Since this will put unknown junk in your namespace, in a real program you’d probably import only the names you actually need, or use ‘import reverb’ and prefix all the names shown here with ‘reverb.’.
A regular expression or Regexp specifies a set of strings that match it.
The most basic form of a Regexp matches only one particular string
Text('spam') # Matches 'spam', and nothing else
No characters in the string have any special meaning, except as defined by Python’s syntax for string literals (for example, if you wanted to match a backslash, you’d need to double it, or use a raw string).
In most cases, you can simply specify a string: the Text() function is only strictly needed when both sides of an operator would otherwise be plain strings (since Python has no way of redefining operators in that case: at least one side has to be a class instance, such as returned by Text()).
In order to make the following examples more meaningful, I’ll define one additional Regexp form here
Digit # Matches any one character from '0' to '9'
Regexps can be combined in various ways to form more complicated expressions.
A + B means a match of A immediately followed by a match of B
Text('spam') + Text('eggs') # Matches 'spameggs'
Note that in this example, you could leave off either of the calls to Text(): plain strings automatically have Text() applied to them when used in an expression with any Regexp form. In fact, you could leave off both Text(), since Python’s definition of ‘+’ on strings happens to match the definition for Regexps.
You can also express concatenation as A & B
'spam' & Digit # Matches 'spam0' thru 'spam9'
A | B means a match of either A or B
Text('spam') | Text('eggs') # Matches 'spam' or 'eggs'
Again, you could leave off either call to Text(). However, in this case you couldn’t omit both calls, as there’s no way to assign a meaning to ‘spam’|’eggs’ in Python. ‘|’ is of lower precedence than concatenation: you can use parentheses as usual to force a different order of evaluation
'spam' | Digit + Digit # Matches 'spam' or a 2-digit number
-A matches anything except what matches A, but not all Regexp forms support this
-Digit # Matches any one character that's not '0' thru '9'
The Regexp forms which support negation will indicate so in their description.
Simple text matches are NOT negatable: this could be implemented, but would result in a very inefficient Regexp (in most cases, you could use -FollowedBy(str), which will be described later). As you might expect, A - B means A + -B
'spam' - Digit # Matches 'spamX', etc., but not 'spam1' or 'spam'
FollowedBy(A) matches at any point where the following text would match A, however that text doesn’t become part of the matched range
'spam' + FollowedBy(' ') # Matches the first 'spam' in
# 'spam spam', but NOT the second
Lookaheads can be negated:
'spam' - FollowedBy(' ') # Matches only the last 'spam' in
# 'spam spam spam'
A match for any character in a range can be written as
Char(A).to(B)
A and B must evaluate to 1-character strings. This form can be negated, matching any character not in the range. More complicated character sets can be built up with
Set(<items>)
where <items> is a comma-separated list of any of the following elements:
Sets can be negated (but not the individual items making up the set), resulting in a set that will match any character other than the ones specified.
Example
Set(Digit, Char('A').to('F'), Char('a').to('f')) # Matches any hexadecimal digit
Hint: the standard string module contains several predefined strings, such as lowercase and octdigits, that you may find useful as the basis for a set. The example above could be more compactly written as Set(string.hexdigits).
Warning
The underlying Regexp implementation specially interprets the characters ‘-‘, ‘^’, ‘]’, and ‘’ within character sets. The current version of reverb does NOT attempt to work around these special interpretations, and may therefore produce an invalid Regexp if any of these characters are used in a set. Here are some work-arounds:
'-' or ']': must be the first character in the set. If you need both, you lose.
'^': must not be the first character in the set.
'\': specify this as R’\’.
There are several forms for specifying that a Regexp can or must match multiple times. The most general is
Repeated(expr, min, max = None)
which will match the expr at least min times (which may be 0), up to max times (unlimited, if max is None or omitted). You may specify the min and max parameters either positionally, or as keyword arguments to make it clearer what they’re for. The other repeat forms are actually special cases of repeated()
Optional(expr, max = None) # 0 or more; Repeated(expr, 0, max)
Required(expr, max = None) # 1 or more; Repeated(expr, 1, max)
Maybe(expr) # 0 or 1; Repeated(expr, 0, 1)
Repeat forms normally use a “greedy” strategy: they initally match as many copies of the Regexp as possible, up to the specified maximum. If this results in the rest of the overall Regexp failing to match, they will try matches of fewer and fewer copies (down to the specified minimum number), in an attempt to make the complete Regexp match successfully.
Sometimes this produces undesired results, and for those cases you can request a “non-greedy” strategy by adding .nongreedy or .minimal to the repeat form. A non-greedy repeat will initially match the specified minimum number of copies of the Regexp, but will try matching additional copies (up to max) if needed to make the complete Regexp match. An example will follow, but first another basic Regexp form needed to understand it
Any # Match any one character (exact details given later)
Given the string ‘[ab] [cd] [ef]’:
'[' + Optional(any) + ']' # Matches the entire string! Oops...
'[' + Optional(any).nongreedy + ']' # Matches '[ab]'
It is possible to assign a label to a portion of a Regexp for later reference. This can be used to extract multiple items of data from a string at one time (rather than using several separate Regexp matches), for fancy search & replace operations, and for verifying that the same text appears at more than one place within the string
Group(expr, name = None) # Match expr, and label the matched text
If a name is specified, it must be a string following the normal rules for a Python identifier: a letter followed by zero or more letters or digits. Even if no name is given, the Group can be referenced by its position, starting with 1 for the leftmost Group() in the complete Regexp.
To the right of a labelled group, it is possible to specify a match of the exact text that the group matched (which might not be the only text that the group’s Regexp was capable of matching)
Match(name) # Match whatever the preceding group(expr, name) did
Example
the following will match any of Python’s formats for string literals, single or triple quoted, except for handling of backslash escapes
delim = Text('"') | Text("'") | Text('"" "') | Text("'''")
# Note: the third text() above shouldn't have a space in it, but I had
# to split it up to keep from prematurely ending the module docstring.
Group(delim, 'q') + Optional(any).nongreedy + Match('q')
Note that the backmatch requires that the starting and ending quote marks be exactly the same: a Regexp like delim+Optional(any)+delim would match strings that start with a single quote and end with a double quote, for example.
Text(str)
Match str exactly. Can also be used on an arbitrary Regexp form (in case you’re passed a parameter which may either be a Regexp or a plain string), in which case that Regexp is returned unchanged.
Any
Match any one character, except for a newline. If the DOTALL flag is specified (more on flags below), will match a newline as well.
Start End
Match at the beginning and end of the string, respectively. If the MULTILINE flag is specified, also matches the beginning and end of each line within the string.
StartString EndString
Match only at the beginning and end of the string, respectively. Only needed if the MULTILINE flag is set, otherwise they’re the same as start & end.
Wordbreak
Matches at the boundary between an alphanumeric character and a non-alphanumeric character: in other words, at either end of a word. Can be negated.
Digit
Matches any decimal digit (0-9). Can be negated, and used in character sets.
Digits
A synonym for digit, which looks better in some contexts.
Whitespace
Matches space, tab, newline, and other whitespace characters. Can be negated, and used in character sets.
Alphanum
Matches any letter, digit, or underscore. If the LOCALE flag is specified, also matches any additional characters that are defined as letters in the current locale, such as accented vowels. Can be negated, and used in character sets.
Raw(str, priority)
Allows a raw Regexp to be passed through to the underlying re module. Specify a priority of 99 for an atomic Regexp (a single character, or a parenthesized form), or 0 if the Regexp needs to be parenthesized to ensure proper precedence if any operators are applied to it.
To actually match strings against any of the Regexp forms defined here, the form must be compiled:
Re(expr, flags = 0)
The result is a compiled RegexObject, as defined in the re module, and can be passed to functions or have its methods called as defined in that module. See the Python Library documentation for details. However, you don’t need to explicitly import re: everything from re is imported into reverb, so you can grab whatever definitions you need as part of your import from reverb.
The optional flags parameter can be specified as any of the following (use ‘|’ to combine multiple values)
I
IGNORegexpCASE
Pattern matching is made case-insensitive.
L LOCALE
Use locale-specific definitions for the alphanum & wordbreak Regexp forms.
M MULTILINE
Allow start & end to match in each line within the string, as well as the string as a whole.
S DOTALL
Allow the any form to match absolutely any character, even a newline.
When using the re module’s sub and subn functions/methods, there’s a helper function in reverb for specifying replacement text that references a group:
Matched(group) # Returns reference to group (by name or position)
The result is just a string, that should be concatenated (by ‘+’, for example) to any other replacement text. Note the subtle difference here: Match() matches a group in the Regexp itself, Matched() references a group in the replacement text. There unfortunately didn’t seem to be any way of writing one function that would work in both places.
The following capabilities of the re module have intentionally been omitted:
This was done since reverb allows (and even encourages) splitting Regexps into multiple subexpressions: embedded flags currently affect the entire Regexp, and would cause unexpected changes in the behavior of subexpressions that don’t contain the embedded flag.
( thanks to Mike Salib for reporting the bugs )
Author: Kay Schluehr ( kay@fiber-space.de )