paulgorman.org

Regular Expressions

In Perl:

$foo =~ m/^(F|f)oo\s*(B|b)ar$/
$foo =~ s/foo bar/bas bat/
$foo =~ tr/[a,e,i,o,u]/[A,E,I,O,U]/

Does not match:

$bar !~ m/foobar/i

Quantifiers

*     Zero or more
+     One or more
?     Zero or one
{7}   Exactly seven
{3,}  Three or more
{2,5} Two, three, four, or five

Putting a question mark after the repetition (like x*?) makes in non-greedy.

Anchors

^     Match beginning of line or string
\A    Start of string
$     Match end of line or string
\Z    End of string
\b    Match word boundary
\B    Non-word boundary
\<    Start of word
\>    End of word

Character classes

\w    Word (alphanumeric plus "_")
\W    Non-word
\s    Whitespace
\S    Non-whitespace
\d    Digit
\D    Non-digit
\c    Control character
\x    Hex digit
\O    Octal digit

\n    Newline
\r    Carriage return
\t    Tab
\v    Vertical tab
\f    Formfeed
\a    Alarm (bell, beep)
\e    Escape

Escapes

\    Excape next character (e.g. \^ for literal carrot rather than line start)
\Q   Begin sequence of literals
\E   End literal sequence

Pattern modifiers

Example: An "i" at the end of the expression makes it case insensitive: $bar =~ m/foobar/i

g   Global (match all)
m   Multi-line (^ and $ match anywhere, not just at the very right and left edges of the string)
s   Single string (. matches anything, including newlines)
x   Improve legibility by permitting whitespace and comments in pattern
a   ASCII-safe matching against Unicode
x   Ignore whitespace in pattern unless it's backslashed or inside brackets (allows writting the regex itself in a more readable format, with line breaks)

If you wanted to ignore case for only part of a regular expression:

/(?i)foobar(?-i)BaT/

Grouping and ranges and backreferences

if($string =~ m/John (Smith|Smyth|Psmith)/) {print "I found John!\n"}
.         Any character except \n
(foo|bar) foo or bar
(?:foo)   Non-capturing group
[xyz]     x or y or z (single character)
[^xyz]    NOT x or y or z
[a-f]     Single character in range a through f

Example: If we want to match "All the king's horse" but not match the escaped "All the king''s horses" (doubled single quote) we combine negating groups with a negative lookahead to match one single quote but not two:

[^']*'(?!')[^']*

Grouping with parens is also the way to capture matches (group $1, $2, etc.). This can also be used for backreferences, like: s/(November) 3rd/\1 4th/g

$1, $2, $3  First, second, third matches
$+     Last/final match
$&     The entire match
$`     Before match
$'     After match

Asertions, lookahead and lookbehind

?=     Positive lookahead
?!     Negative lookahead
?<=    Positive lookbehind
?<!    Negative lookbehind
?>     Once-only sub-expression
?()    Conditional if-then
?()|   Conditional if-then-else
?#     Comment

A regex with positive lookahead matches something followed by something else. foo(?=t).* matches "football" but not "foobar".

A regex with negative lookahead matches something not followed by something else. foo(?!t).* matches "foobar" but not "football".

Lookbehind works the same way, with (?<=foot)ball ("ball" preceded by "foot") and (?<!wrecking)ball ("ball" not preceded by "wrecking").

POSIX classes

[:upper:]    Like [A-Z]
[:lower:]    Like [a-z]
[:alpha:]    Like [a-zA-Z]
[:digit:]    Like [0-9]
[:alnum:]    Like [a-zA-Z0-9]
[:word:]     Like [a-zA-Z0-9_]
[:xdigit:]   Like [0-9a-f]
[:punct:]    Any punctuation
[:space:]    Like [\t\r\n\f\v]
[:blank:]    Space or tab

POSIX regular expressions come in two types: Basic and Extended. Extended POSIX regular expressions are more Perl-like and generally more powerful, although they lack back-references. Basic POSIX regular expressions include back references, like \1\2 for the first and second matches. However, basic regular expressions lack support for alternate either/or groups, like `(foo|bar)`.See re_format(7).

Links