Patterns consists of literal strings and character classes. Patterns may contain sub-patterns, which are patterns enclosed in parentheses.
In patterns as well as in character classes, some characters have a special meaning. To literally match any of those characters, they must be marked or escaped to let the regular expression software know that it should interpret such characters in their literal meaning.
This is done by prepending the character with a backslash
(\
).
The regular expression software will silently ignore escaping a
character that does not have any special meaning in the context, so
escaping for example a “j” (\j
) is
safe. If you are in doubt whether a character could have a special
meaning, you can therefore escape it safely.
Escaping of course includes the backslash character itself, to
literally match a such, you would write
\\
.
A character class is an expression that
matches one of a defined set of characters. In Regular Expressions,
character classes are defined by putting the legal characters for the
class in square brackets, []
, or by using one of
the abbreviated classes described below.
Simple character classes just contains one or more literal
characters, for example [abc]
(matching either
of the letters “a”, “b” or “c”)
or [0123456789]
(matching any digit).
Because letters and digits have a logical order, you can
abbreviate those by specifying ranges of them:
[a-c]
is equal to [abc]
and [0-9]
is equal to
[0123456789]
. Combining these constructs, for
example [a-fynot1-38]
is completely legal (the
last one would match, of course, either of
“a”,“b”,“c”,“d”,
“e”,“f”,“y”,“n”,“o”,“t”,
“1”,“2”,“3” or
“8”).
As capital letters are different characters from their
non-capital equivalents, to create a caseless character class matching
“a” or “b”, in any case, you need to write it
[aAbB]
.
It is of course possible to create a “negative”
class matching as “anything but” To do so put a caret
(^
) at the beginning of the class:
[^abc]
will match any character
but “a”, “b” or
“c”.
In addition to literal characters, some abbreviations are defined, making life still a bit easier:
\a
This matches the ASCII bell character (BEL, 0x07).
\f
This matches the ASCII form feed character (FF, 0x0C).
\n
This matches the ASCII line feed character (LF, 0x0A, Unix newline).
\r
This matches the ASCII carriage return character (CR, 0x0D).
\t
This matches the ASCII horizontal tab character (HT, 0x09).
\v
This matches the ASCII vertical tab character (VT, 0x0B).
\xhhhh
This matches the Unicode character corresponding to the hexadecimal number hhhh (between 0x0000 and 0xFFFF). \0ooo (i.e., \zero ooo) matches the ASCII/Latin-1 character corresponding to the octal number ooo (between 0 and 0377).
.
(dot)This matches any character (including newline).
\d
This matches a digit. Equal to
[0-9]
\D
This matches a non-digit. Equal to
[^0-9]
or[^\d]
\s
This matches a whitespace character. Practically equal to
[ \t\n\r]
\S
This matches a non-whitespace. Practically equal to
[^ \t\r\n]
, and equal to[^\s]
\w
Matches any “word character” - in this case any letter, digit or underscore. Equal to
[a-zA-Z0-9_]
\W
Matches any non-word character - anything but letters, numbers or underscore. Equal to
[^a-zA-Z0-9_]
or[^\w]
The POSIX notation of classes,
[:<class name>:]
are also supported.
For example, [:digit:]
is equivalent to \d
,
and [:space:]
to \s
.
See the full list of POSIX character classes
here.
The abbreviated classes can be put inside a custom class, for
example to match a word character, a blank or a dot, you could write
[\w \.]
The following characters has a special meaning inside the “[]” character class construct, and must be escaped to be literally included in a class:
]
Ends the character class. Must be escaped unless it is the very first character in the class (may follow an unescaped caret).
^
(caret)Denotes a negative class, if it is the first character. Must be escaped to match literally if it is the first character in the class.
-
(dash)Denotes a logical range. Must always be escaped within a character class.
\
(backslash)The escape character. Must always be escaped.
If you want to match one of a set of alternative patterns, you
can separate those with |
(vertical bar character).
For example to find either “John” or “Harry” you would use an expression John|Harry
.
Sub patterns are patterns enclosed in parentheses, and they have several uses in the world of regular expressions.
You may use a sub pattern to group a set of alternatives within a larger pattern. The alternatives are separated by the character “|” (vertical bar).
For example to match either of the words “int”,
“float” or “double”, you could use the
pattern int|float|double
. If you only want to
find one if it is followed by some whitespace and then some letters,
put the alternatives inside a subpattern:
(int|float|double)\s+\w+
.
If you want to use a back reference, use a sub pattern (PATTERN)
to have the desired part of the pattern remembered.
To prevent the sub pattern from being remembered, use a non-capturing group
(?:PATTERN)
.
For example, if you want to find two occurrences of the same
word separated by a comma and possibly some whitespace, you could
write (\w+),\s*\1
. The sub pattern
\w+
would find a chunk of word characters, and the
entire expression would match if those were followed by a comma, 0 or
more whitespace and then an equal chunk of word characters. (The
string \1
references the first sub pattern
enclosed in parentheses.)
Note
To avoid ambiguities with usage of \1
with some digits behind it (e.g. \12
can be 12th subpattern or just the first subpattern with 2
) we use \{12}
as syntax for multi-digit subpatterns.
Examples:
\{12}1
is “use subpattern 12”\123
is “use capture 1 then 23 as the normal text”
A lookahead assertion is a sub pattern, starting with either
?=
or ?!
.
For example to match the literal string “Bill” but
only if not followed by “ Gates”, you could use this
expression: Bill(?! Gates)
. (This would find
“Bill Clinton” as well as “Billy the kid”,
but silently ignore the other matches.)
Sub patterns used for assertions are not captured.
See also Assertions.
A lookbehind assertion is a sub pattern, starting with either
?<=
or ?<!
.
Lookbehind has the same effect as the lookahead, but works backwards.
For example to match the literal string “fruit” but
only if not preceded by “grape”, you could use this
expression: (?<!grape)fruit
.
Sub patterns used for assertions are not captured.
See also Assertions
The following characters have meaning inside a pattern, and must be escaped if you want to literally match them:
\
(backslash)The escape character.
^
(caret)Asserts the beginning of the string.
$
Asserts the end of string.
()
(left and right parentheses)Denotes sub patterns.
{}
(left and right curly braces)Denotes numeric quantifiers.
[]
(left and right square brackets)Denotes character classes.
|
(vertical bar)logical OR. Separates alternatives.
+
(plus sign)Quantifier, 1 or more.
*
(asterisk)Quantifier, 0 or more.
?
(question mark)An optional character. Can be interpreted as a quantifier, 0 or 1.