Regular Expressions for Word Puzzles

Regular expressions are a powerful way of specifying patterns of text. The classical tool for performing searches based on them is grep, a built-in utility on Unix/Linux systems. Grep operates as a filter; it reads in a text file, checks each line separately against the expression, and outputs the lines that match. For searching for words matching a pattern, you use as the input file a word list which contains one word per line.

If you don't have grep, you can use a tool on the web, such as this one hosted by the National Puzzlers' League. You'll be limited to word lists the web site has.

A Note About Escaping Special Characters

Most characters (including all the letters and numbers) just represent themselves in a regular expression, but some of them have special meanings which are described below. Most of the characters with special meanings can be escaped by placing a backslash (\) before them, so \. would match a period, and \\ would match a backslash.

There are two different versions of grep syntax regarding certain special characters described in the Grouping Expressions and Other Repeat Indicators sections below.

In the so-called extended syntax, which is used by the NPL grep tool, the following special characters work as described in this document:

parentheses and pipes used to enclose groups
curly braces, question marks, and plus signs used as repeat indicators.

If you want to match a literal instance of one of these characters in the input, precede it with a backslash (\).

In basic syntax, which is used by the grep command, these are swapped. Parentheses, pipes, curly braces, question marks, and plus signs match those literal characters, and you must precede them with backslashes to match the literal characters. In this syntax, for instance, $.$\1 matches a double letter (any character appearing twice in a row; this syntax is described under Grouping Characters).

The Most Common Special Characters

The period (.) is a wildcard in regular expressions; it matches any single character. The asterisk (*) is a repeat indicator; it means that any number (including zero) repetitions of the previous expression may occur. An expression is usually a single character; l*ama would match ama, lama, and llama (and if your word list is from Ogden Nash, lllama). These two characters are often used together (.*) to match any sequence of any length, including nothing at all.

Matching at the Beginning or End of the Line

When the caret (^) appears at the first character in a regular expression, it forces the rest of the expression to match only at the beginning of the line. Similarly, when the dollar sign ($) is the last character in the expression, it forces the rest of the expression to match the end of the line. These can be used together to force an expression to match the entire line.

If the example above, l*ama, was the entire regular expression, it would also match words such as lamaze and camaraderie. ^l*ama would not match camaraderie and ^l*ama$ would match only ama, lama, llama, and lllama.

In our search engine, this is sometimes not necessary because you can specify the length of the words to search separately from the regular expression, but there are still many times (when searching on patterns of variable length, or for words of varying length, with some pattern that must appear at the beginning or end of the word) when these are useful.

Character Ranges

Square brackets ( [ and ] ) can be used to specify that any of a specified set of characters is valid in a particular position. You may specify all the allowed characters explicitly, such as [aeiouy] to match any vowel.

Alternatively, you can use hyphens to indicate ranges such as [a-z] to match any letter. Place a hyphen first or last to include it among the set of characters which can be matched.

If the caret ^ appears as the first character inside the brackets, it inverts the match, so the bracketed expression will match any character except the ones inside the brackets. For instance, [^aeiouy] will match any character other than a vowel. [^A-Za-z] matches any non-letter.

Grouping Expressions

Parentheses (that is, ( and ) ) can be used to enclose groups of characters that should be treated as a single expression.

One use for these is in combination with the * to indicate that an expression consisting of more than one character may be repeated any number of times. The expression ^(a.)*a$ will match words of odd length in which every other letter is a, such as aba, amana, and alabama.

A second use for these expressions is to describe alternatives larger than a single character, using the pipe character ( | ) between the alternatives. For instance, you could search for words in a piecemeal square with an expression like ^(ab|ar|co|gh|jo|pa|se|yn)*$ (filling in whatever parts you have available). If you're willing to type a bit more, you can use these expressions to match all possible letter groups that can get dropped out of an order takeout (the expression should contain all 25 alphabetic bigrams and all 24 trigrams).

A third use for these expressions is to repeat an unknown sequence multiple times. Backslash-numbers ( \1, \2, \3, etc.) match repetitions of expressions enclosed in ( ) earlier in the expression. The number indicates which parenthesized group (counting from the start of the regular expression) it must match. For instance, ^(..).*\1$ will match words that begin and end with the same bigram, such as church and escapes.

Note that these back-references have to match the same sequence of characters the original group matched, while repeat indicators, such as * used after an expression can match any sequence of characters which matches the expression repeatedly.

Other Repeat Indicators

The question mark (?) can be used to indicate that the preceding expression is optional -- it can appear once or not at all. The plus sign (+) indicates one or more repetitions; it is similar to *, but with + the expression must appear at least once.

To indicate more precise repeat limits, you can use curly braces ( { and } ). {2,5} means that the preceding expression must appear at least twice but no more than 5 times. {,5} means at most 5 times, and {2,} means two or more times.

Case Sensitivity

Normally, searches are case-sensitive. Often you will want to perform a non-case-sensitive search. For instance, ^(a.)*a$ will miss words with capital As, such as Alabama or AAA unless you indicate that the search should not be case-sensitive. This is indicated by a switch separate from the regular expression, accesssible as a separate field on the NPL search form and the -i option to the grep command.

Inverting the Search

You can perform a search such that it finds words that do not match the expression. This is also indicated by an external switch, available as a "does match/does not match" selection on the advanced version of the NPL search form and the -v option to the grep command. An inverted search on (.).*\1 will return only words than do not contain any repeated letters.

Searching Only Alphabetic Characters

Regular expressions don't inherently do this. The NPL grep tool provides this feature using an alternate, pre-filtered word list to remove non-alphabetic characters. Normally, all the spaces, hyphens, apostrophes, etc. in a word or phrase are included in the word list and count as characters. When you have a full enumeration you can account for such characters, but sometimes you don't.

For instance, when searching only alphabetic characters, the phrase "beat around the bush" is turned into "beataroundthebush" and then would match a crossword pattern like "^b.a.a.....t..b..h$"

Using Multiple Searches

The advanced search page allows you to perform two searches, and return only those words which match (or don't match, as appropriate) both searches.

For instance, if you want to find words which repeat their initial bigram later in the word, but have no other repeated letters, you can search for ^(..).*\1 and then among these words, search for words which do not match ..(.).*\1 using the the initial two periods in the second expression to match the two characters, so the characetr enclosed by the parentheses matches the third or later character, and the back-reference matches that character again later in the word.

Regular expressions are not very good for doing transposals, but with two searches and some of the other options above, you can get a pretty close approximation to a transposal. To do this, in your first search, look for words that contain only the letters in your transposal. If you were looking for a transposal of "penetralia," search for ^[pentrali]*$. Then in the second search, search for words that have no repeated letters besides a and e by searching for words that do not match the regular expression ([^ae]).*\1 (the expression matches words with a repeated character other than a or e). Finally, combine this with the non-case-sensitive options on both searches, the option to search only alphabetic characters, and a word length of 10.

This isn't perfect, as it may find words which have more than two occurrences of a and/or e, and correspondingly fewer occurrences of the other of a and e, or none at all of some other letter. (On the ENABLE list, it finds planetaria, which has an extra a and no e. On other lists it finds a few other words but no exact transposals, besides, of course, penetralia itself.) Realistically, you are better off using an anagram tool, such as I, Rearrangement Servant, to perform this search, instead of a regular expression.