Regexp Syntax Summary

This table summarizes the meaning of various strings in different regexp syntaxes. It is intended as a quick reference, rather than a tutorial or specification. Please report any errors.

String	GNU grep	BRE (grep)	ERE (egrep)	GNU Emacs	Perl	Python	Tcl
`.`	Any character	Any character except `\0`		Any character except `\n`			Any character
`[`...`]`	Bracket Expression			Character Set	Character Class		Bracket Expression
`$`re`$`	Subexpression			Grouping
re`\{`...`\}`	Match re multiple times			Match re multiple times
`(`re`)`			Subexpression		Grouping
re`{`...`}`			Match re multiple times		Match re multiple times
re`{`...`}?`					Nongreedy {}
`\`digit	Back-reference
`^`	Start of line
`$`	End of line
re`?`			re 0 or 1 times
re`*`	re 0 or more times
re`+`			re one or more times
l`\|`r			l or r		l or r
`*?`				Non-greedy `*`
`+?`				Non-greedy `+`
`??`				Non-greedy `?`
`\A`					Start of string
`\b`	Either end of word			Either end of word
`\B`	Not either end of word			Not either end of word			Synonym for `\`
`\c`C				Any in category C
`\C`C				Any not in category C
`\C`					Any octet
`\d`					Digit
`\D`					Non-digit
`\G`					At `pos()`
`\m`							Start of word
`\M`							End of word
`\p`property `\p{`property`}`					Unicode property
`\P`property `\P{`property`}`					Not unicode property
`\s`C				Any with syntax C
`\S`C				Any with syntax not C
`\s`					Whitespace
`\S`					Non-whitespace
`\w`	Same as `[[:alnum:]]`			Same as `\sw`	Alphanumeric and `_`
`\W`	Same as `[^[:alnum:]]`			Same as `\Sw`	Not alphanumeric or `_`
`\X`					Combining sequence
`\y`							Either end of word
`\y`							Not either end of word
`\Z`					End of string/last line	End of string
`\z`					End of string
\`				Start of buffer/string
`\'`				End of buffer/string
`\<`	Start of word			Start of word
`\>`	End of word			End of word
re`\?`	re 0 or 1
re`\+`	re 1 or more
l`\\|`r	l or r			l or r
`(?#`text`)`					Comment, ignored
`(?`modifiers`)`					Embedded modifiers
`(?`modifiers`:`re`)`					Shy grouping + modifiers
`(?:`re`)`					Shy grouping
`$?:`...`$`				Shy grouping
`(?=`re`)`					Lookahead
`(?!`re`)`					Negative lookahead
`(?<=`p`)`					Lookbehind
`(?<!`o`)`					Negative lookbehind
`(?{`code`})` `(??{`code`})`					Embedded Perl
`(?>`re`)`					Independent expression
`(?(`cond`)`re`)` `(?(`cond`)`re`\|`re`)`					Condition expression
`(?P<`name`>re)`						Symbolic grouping
`(?P=`name`)`						Symbolic backref
String	GNU grep	BRE (grep)	ERE (egrep)	GNU Emacs	Perl	Python	Tcl

Who Uses What?

BRE refers to POSIX "basic regular expressions" and ERE is POSIX "extended regular expressions".

APIs

regcomp uses BREs by default but can also use EREs. It has a variety of other options which modify the syntax slightly.

Boost's regex++ supports a variety of syntaxes.

PCRE is almost the same as Perl, though it doesn't support the embedded Perl feature and the man page lists a number of other differences.

Languages

awk is supposed to use EREs, plus the extra C-style escapes \\, \a, \b, \f, \n, \r, \t, \v with their usual meanings. sed is supposed to use BREs, plus \n with its usual meaning.

lex is also supposed to use EREs with some extensions: "..." quotes everything inside it (backslash escapes are recognized); an initial <state> matches a start condition; r/x matches r only when followed by x; and {name} matches the value of a substitution symbol. A variety of escape sequences, including the usual C ones, are recognized. Possibly this deserves a new column.

Tools

grep is supposed to use BREs, except that grep -E uses EREs. (GNU grep fits some extensions in where POSIX leaves the behaviour unspecified). egrep uses EREs. grep -F doesn't use regexps at all, of course.

ed uses BREs. ex and vi use BREs but additionally support \< and \> as described above, and use ~ to match the replacement part of the previous substitution.

expr uses BREs with all patterns implicitly anchored at the start.

The regexp syntax accepted by less depends on how it is built but PCRE and POSIX EREs are likely outcomes on modern systems.

Vim has enough differences and extensions that it perhaps deserves a column (or two) to itself.

Subexpressions, Grouping and Back-References

Subexpressions or groups are surrounded by ( and ), or sometimes $ and $. They serve two purposes; firstly they override the precedence rules of other operators, and secondly they "capture" part of the text matched by a regexp. This can then be used later on in the regexp via the \digit syntax (this is called a back-reference) or outside the regexp to extract the appropriate part of a string.

"Shy grouping" has the precedence-overriding feature but not the capturing feature.

"Symbolic grouping" allows groups to be identified by name rather than number.

Match Multiple Times

The syntax of this varies a bit; sometimes you used \{ and \}, and sometimes you use { and }. However the idea is the same:

RE{N} will match RE exactly N times.
RE{N,} will match RE N or more times.
RE{N,M} will match RE between N and M times (inclusive).

It is worth nothing that the GNU Grep manual says:

   Traditional `egrep' did not support the `{' metacharacter, and some
`egrep' implementations support `\{' instead, so portable scripts
should avoid `{' in `egrep' patterns and should use `[{]' to match a
literal `{'.

Bracket Expressions

This refers to expressions in [square brackets], for which POSIX defines a complicated syntax all of their own.

Firstly, if the first character after the [ is a ^ (caret) then the sense of the match is reversed.

The rest of the bracket expression consists of a sequence of elements selected from the following list. The bracket expression as a whole matches any character (or character sequence) that is matched by at least one of them (or is matched by none of them, if an initial ^ was used).

1. Collating symbols. These look like [.element.], where element is a collating element (i.e. a symbolic name for a multi-character string), and match the value of the collating element in the current locale. This doesn't seem to work in GNU grep.

2. Equivalence classes. These look like [=element=], where element is a collating element. They match any collating element (single or multiple characters) which has the same primary weight as element, i.e. if they appear in the same place in the current locale's collation sequence. This doesn't seem to work in GNU grep.

3. Character classes. These look like [:class:], where class is the name of the character class to match. The following character classes exist in all locales:

[:alnum:] [:alpha:] [:blank:] [:cntrl:] [:digit:]
[:graph:] [:lower:] [:print:] [:space:] [:upper:]

4. Range expressions. These look like start-end where start and end are either single characters or collating symbols. The behaviour is only specified in the POSIX locale, where they match all the characters between start and end inclusive.

5. Single characters. These match themselves.

To include a ], put it immediately after the opening [ or [^; if it occurs later it will close the bracket expression. The hyphen (-) is not treated as a range separator if it appears first or last, or as the endpoint of a range.

Emacs "character sets" are similar to bracket expressions, except that collating symbols, equivalence classes and character classes aren't supported.

Perl "character classes" are also similar. They support POSIX character class syntax (argh, confusing names!) and recognize, but don't support, collating symbols or equivalence classes.

GNU Grep and `.`

GNU Grep has slightly strange handling of . and newlines.

Firstly, the manual says that . matches "any single character". Superficially it appears not to match the newline character:

$ echo | grep .
$

The outcome is actually in keeping with standard and traditional behaviour for grep, where the newline is not included in the text to be matched. But that doesn't appear to be quite what's going on with the GNU version, as explicitly searching for a newline does produce a match:

$ echo | perl -e 'exec("/usr/bin/grep","\n");'

$

So is there a newline to match against or not?

The other case to consider is when the -z or --null-data option is used. In that case, . definitely does match a newline, exactly as the manual says:

$ perl -e 'print "\n\0";' | grep -z . | od -tx1
0000000 0a 00
0000002
$

Sources

The POSIX regular expression specification can be found at http://www.opengroup.org/onlinepubs/007904975/basedefs/xbd_chap09.html. For the regexp languages used by particular programs, I looked at the documentation for GNU Grep 2.4.2; GNU Emacs 21.2.1; Perl 5.6.1; Python 2.2.1; Tcl 8.3.3; and less 458.

All errors are my own!

RJK | Contents

Regexp Syntax Summary

Who Uses What?

APIs

Languages

Tools

Subexpressions, Grouping and Back-References

Match Multiple Times

Bracket Expressions

GNU Grep and `.`

Perl Variations

`.` and newlines

Anchors

"Lookbehind" Matching

Sources

Regexp Syntax Summary

Who Uses What?

APIs

Languages

Tools

Subexpressions, Grouping and Back-References

Match Multiple Times

Bracket Expressions

GNU Grep and .

Perl Variations

. and newlines

Anchors

"Lookbehind" Matching

Sources

GNU Grep and `.`

`.` and newlines