Regular Expressions

A regular expression (RE) is a mechanism for describing patterns in text when executing the Find, the Replace, and the "Find in Files" commands. An RE is made up of ordinary characters, some of which take on the special meanings described below.

See How to Use Regular Expressions for the differences between the default syntax and POSIX syntax.

Ordinary Characters

An ordinary character is an RE that matches itself. It can be any character, except <newline> and the special characters listed below. An ordinary character preceded by a backslash is treated as the ordinary character itself, except when the character is (, ), <, >, or the letters f, n, t and x, or the digits 1 through 9.

Hex Characters

Any character can be represented by its hex value. This is specified with the pattern \xdd, where dd is any 2-digit hexadecimal number, excluding zero.

Tabs

A tab character is represented by the pattern \t.

Page Breaks

A page break (form feed) character is represented by the pattern \f.

Line Breaks

A line break is represented by the pattern \n. This matches carriage return and line feed characters. Note that these cannot be combined with repetition operators (see below), so you can only match an exact number of them (e.g. \n\n will match a single blank line.) Do not use this for constraining matches to the end of a line, as it's much more efficient to use "$" (see Expression Anchoring below). This pattern should only be used to match text that spans line boundaries.

Special Characters

These special characters, can be rendered ordinary by preceding them with a backslash (\), if they are single special characters, or removing the preceding backslash if they are compound special characters.

Character Context
. [ \ The period, left square bracket, and backslash are special except when used in a Class Expression.
* ? + Asterisk, question mark and plus are special except when used in a class expression, as the first character of an RE, or as the first character of a Tagged Expression.
- The hyphen is special in a Class Expression, except as the first or last character of that expression.
^ The circumflex is special when used as the first character of an entire RE (see Expression Anchoring), or as the first character of a Class Expression.
$ The dollar sign is special when used as the last character of an entire RE (see Expression Anchoring).

Wildcard Character

The period (.), when used outside of a class expression, matches any character except newline.

Repetition Operators

The asterisk (*) matches zero or more occurrences of the smallest possible preceding regular expression, while the question mark (?) matches zero or one, and the plus sign (+) matches at least one occurrence. For example, A*b+ matches zero or more A's followed by one or more b's.

Interval Operator

Repeats the smallest possible preceding regular expression the given number of times. The options are:

\{count\} Matches exactly count times.
\{min,\} Matches at least min times.
\{min,max\} Matches between min and max times.

Alternation Operator

The alternation operator (\|) matches either the expression to its left or the one to its right. It has a lower precedence of any other regular expression operator, so the surrounding RE's must be bracketed with \(...\) if only a part of them is to be matched.

Class Expressions

A class expression is a RE, enclosed in square brackets ([…]), that matches any one of the elements contained in the brackets. The permitted elements of a class expression are:

Simple Characters:

These are single characters that match themselves. To match a right square bracket (]), it must be the first character of the class expression, after any initial circumflex (see Negated Class Expressions). To match a hyphen, it must be either the first or the last character of the class expression. For example [AaBb] matches upper or lower case A or B.

Negated Class Expressions:

If the first character of a class expression is the circumflex (^), the expression matches any character not in the class. For example [^AB^] matches any character except A, B and the circumflex itself.

Range Expressions:

A range expression is two characters separated by a hyphen (-). It matches any characters with code points between those of the two characters. For example, [A-Za-z0-9-] matches any upper or lower case letter or digit, or the hyphen itself. Note that [a-z] also matches upper case letters, unless the option to match case is selected.

Character Class Operators:

These can be used as an alternative way of representing classes of characters. For example, [a-z] is equivalent to [[:lower:]] and [a-z0-9] is equivalent to [[:lower:][:digit:]]. (Note the extra pairs of brackets.) The defined classes are:

Expression Description
[:alpha:] Any letter.
[:lower:] Any lower case letter.
[:upper:] Any upper case letter.
[:alnum:] Any digit or letter.
[:digit:] Any digit.
[:xdigit:] Any hexadecimal digit (0-9, a-f or A-F).
[:blank:] Space or tab.
[:space:] Space, tab, vertical tab or form feed.
[:cntrl:] Control characters (Delete and ASCII codes less than space).
[:print:] Printable characters, including space.
[:graph:] Printable characters, excluding space.
[:punct:] Anything that is not a control or alphanumeric character.
[:word:] Letters, hypens and apostrophes.
[:token:] Any of the characters defined on the Syntax page for the document class, or in the syntax definition file if syntax highlighting is enabled for the document class.

Expression Anchoring

An RE can be restricted to matching strings that begin or end a line or word, as follows:

^ A circumflex as the first character of an RE anchors the expression to the beginning of the line.
$ A dollar sign as the last character of an RE anchors the expression to the end of the line.
\< The character pair \< anchors the next RE to the start of a word.
\> The character pair \> anchors the previous RE to the end of a word.

Tagged Expressions

A tagged expression is an RE that starts with the pair \( and ends with the pair \). There can be up to nine such expressions in a complete RE. Such an expression matches the same as the expression without the surrounding \( and \). The first expression defined in this way can be referenced as \1 later in the RE, and so on up to \9 for the ninth tagged expression. Each such reference matches the same string as its original tagged expression. For example \(tu\) \1 matches the string "tu tu".

References to tagged expressions can also be used in Replacement Expressions.