Regular Expressions

In Perl:

$foo =~ m/^(F|f)oo\s*(B|b)ar$/
$foo =~ s/foo bar/bas bat/
$foo =~ tr/[a,e,i,o,u]/[A,E,I,O,U]/</pre>

Does not match:

$bar !~ m/foobar/i

Quantifiers

* Zero or more
+ One or more
? Zero or one
{7} Exactly seven
{3,} Three or more
{2,5} Two, three, four, or five

Putting a question mark after the repetition (like x*?) makes in non-greedy.

Anchors

^ Match beginning of line or string
\A Start of string
$ Match end of line or string
\Z End of string
\b Match word boundary
\B Non-word boundary
\< Start of word
\> End of word

Character classes

\w Word (alphanumeric plus “_”)
\W Non-word
\s Whitespace
\S Non-whitespace
\d Digit
\D Non-digit
\c Control character
\x Hex digit
\O Octal digit
“
\n Newline
\r Carriage return
\t Tab
\v Vertical tab
\f Formfeed
\a Alarm (bell, beep)
\e Escape

Escapes

\ Excape next character (e.g. \^ for literal carrot rather than line start)
\Q Begin sequence of literals
\E End literal sequence

Pattern modifiers

Example: An i at the end of the expression makes it case insensitive:

$bar =~ m/foobar/i

g Global (match all)
m Multi-line (^ and $ match anywhere, not just at the very right and left edges of the string)
s Single string (. matches anything, including newlines)
x Improve legibility by permitting whitespace and comments in pattern
a ASCII-safe matching against Unicode
x Ignore whitespace in pattern unless it’s backslashed or inside brackets (allows writing the regex itself in a more readable format, with line breaks)

If you wanted to ignore case for only part of a regular expression:

/(?i)foobar(?-i)BaT/

Grouping and ranges and backreferences

if ($string =~ m/John (Smith|Smyth|Psmith)/) {
	print "I found John!\n"
}

. Any character except \n
(foo|bar) foo or bar
(?:foo) Non-capturing group
[xyz] x or y or z (single character)
[^xyz] NOT x or y or z
[a-f] Single character in range a through f

Example: If we want to match “All the king’s horse” but not match the escaped “All the king”s horses” (doubled single quote) we combine negating groups with a negative lookahead to match one single quote but not two:

[^']*'(?!')[^']*

Grouping with parens is also the way to capture matches (group $1, $2, etc.). This can also be used for backreferences, like:

s/(November) 3rd/\1 4th/g

$1, $2, $3 First, second, third matches
$+ Last/final match
$& The entire match
$ Before match
$' After match

Asertions, lookahead and lookbehind

?= Positive lookahead
?! Negative lookahead
?<= Positive lookbehind
?<! Negative lookbehind
?> Once-only sub-expression
?() Conditional if-then
?()| Conditional if-then-else
?# Comment

A regex with positive lookahead matches something followed by something else. foo(?=t).* matches “football” but not “foobar”.

A regex with negative lookahead matches something not followed by something else. foo(?!t).* matches “foobar” but not “football”.

Lookbehind works the same way, with (?<=foot)ball (“ball” preceded by “foot”) and (?<!wrecking)ball (“ball” not preceded by “wrecking”).

POSIX classes

[:upper:] Like [A-Z]
[:lower:] Like [a-z]
[:alpha:] Like [a-zA-Z]
[:digit:] Like [0-9]
[:alnum:] Like [a-zA-Z0-9]
[:word:] Like [a-zA-Z0-9_]
[:xdigit:] Like [0-9a-f]
[:punct:] Any punctuation
[:space:] Like [\t\r\n\f\v]
[:blank:] Space or tab

POSIX regular expressions come in two types: Basic and Extended. Extended POSIX regular expressions are more Perl-like and generally more powerful, although they lack back-references. Basic POSIX regular expressions include back references, like \1\2 for the first and second matches. However, basic regular expressions lack support for alternate either/or groups, like (foo|bar). See re_format(7).