Regular Expressions for Word Puzzles

Regular expressions are a powerful way of specifying patterns of text. Some programs, such as the word list search engine, can quickly scan through a text file and output only the lines which match the expression. When each line contains a word or dictionary phrase, as in these word lists, you can perform searches for various kinds of wordplay.

Most characters (including all the letters and numbers) just represent themselves in a regular expression, but some of them have special meanings you should be aware of. Most of the characters with special meanings can be escaped by placing a backslash (\) before them, so \. would match a period, and \\ would match a backslash. (On the other hand, some characters which normally represent themselves get special meaning when preceded by a backslash.)

The Most Common Special Characters

The period (.) is a wildcard in regular expressions; it matches any single character. The asterisk (*) is a repeat indicator; it means that any number (including zero) repetitions of the previous expression may occur. An expression is usually a single character; l*ama would match ama, lama, and llama. These two characters are often used together (.*) to match any sequence of any length

Matching at the Beginning or End of the Text

When the caret (^) appears at the first character in a regular expression, it forces the rest of the expression to match only at the beginning of the text. Similarly, when the dollar sign ($) is the last character in the expression, it forces the rest of the expression to match the end of the text. These can be used together to force an expression to match the entire text.

If the example above, l*ama, was the entire regular expression, it would also match words such as lamaze and camaraderie. ^l*ama would not match camaraderie and ^l*ama$ would match only ama, lama, and llama.

In our search engine, this is sometimes not necessary because you can specify the length of the words to search separately from the regular expression, but there are still many times (when searching on patterns of variable length, or for words of varying length, with some pattern that must appear at the beginning or end of the word) when these are useful.

Character Ranges

Square brackets ( [ and ] ) can be used to specify that any of a specified set of characters is valid in a particular position. You may specify all the allowed characters explicitly, such as [aeiouy] to match any vowel.

Alternatively, you can use hyphens to indicate ranges such as [a-z] to match any letter. Place a hyphen first or last to include it among the set of characters which can be matched.

If the caret ^ appears as the first character inside the brackets, it inverts the match, so the bracketed expression will match any character except the ones inside the brackets. For instance, [^aeiouy] will match any character other than a vowel.

Grouping Expressions

Backslash+parentheses (that is, \( and \) ) can be used to enclose groups of characters that should be treated as a single expression.

One use for these is in combination with the * to indicate that an expression consisting of more than one character may be repeated any number of times. The expression ^\(a.\)*a$ will match words of odd length in which every other letter is a, such as aba, amana, and alabama.

A second use for these expressions is to describe alternatives larger than a single character, using the backslash-pipe ( \| ) between the alternatives. For instance, you could search for words in a piecemeal square with an expression like ^\(ab\|ar\|co\|gh\|jo\|pa\|se\|yn\)*$ (filling in whatever parts you have available). If you're willing to type a bit more, you can use these expressions to match all possible letter groups that can get dropped out of an order takeout (the expression should contain all 25 alphabetic bigrams and all 24 trigrams).

A third use for these expressions is to repeat an unknown sequence multiple times. Backslash-numbers ( \1, \2, \3, etc.) match repetitions of expressions enclosed in \( \) earlier in the expression. For instance, ^\(..\).*\1$ will match words that begin and end with the same bigram, such as church and escapes.

Case Sensitivity

Normally, searches are case-sensitive. Often you will want to perform a non-case-sensitive search. For instance, ^\(a.\)*a$ will miss words with capital As, such as Alabama or AAA unless you indicate that the search should not be case-sensitive. This is indicated by a switch separate from the regular expression, accesssible as a checkbox on the search form.

Inverting the Search

You can perform a search such that it finds words that do not match the expression. This is also indicated by an external switch, available as a "does match/does not match" selection on the advanced search form. An inverted search on \(.\).*\1 will return only words than do not contain any repeated letters.

Other Repeat Indicators

The question mark (?) can be used to indicate that the preceding expression is optional -- it can appear once or not at all. The plus sign (+) indicates one or more repetitions; it is similar to *, but with + the expression must appear at least once.

To indicate more precise repeat limits, you can use backslash-curly braces ( \{ and \} ). \{2,5\} means that the preceding expression must appear at least twice but no more than 5 times. \{,5\} means at most 5 times, and \{2,\} means two or more times.

Searching Only Alphabetic Characters

This is another feature that is not part of the regular expression. It is instead a treatment on the word list before searching it. Normally, all the spaces, hyphens, apostrophes, etc. in a word or phrase are included in the word list and count as characters. When you have a full enumeration you can account for such characters, but sometimes you don't.

The option to search only alphabetic characters will remove other characters from the words before comparing them against the regular expression. For instance, the phrase "beat around the bush" would be turned into "beataroundthebush" and then would be able to match a crossword pattern like "^b.a.a.....t..b..h$"

Using Multiple Searches

The advanced search page allows you to perform two searches, and return only those words which match (or don't match, as appropriate) both searches.

For instance, if you want to find words which repeat their initial bigram later in the word, but have no other repeated letters, you can search for ^\(..\).*\1 and then among these words, search for words which do not match ..\(.\).*\1 (note that the initial two periods in the second expression allow it to match a letter which appears twice in a word later than the second letter).

Regular expressions are not very good for doing transposals, but with two searches and some of the other options above, you can get a pretty close approximation to a transposal. To do this, in your first search, look for words that contain only the letters in your transposal. If you were looking for a transposal of "penetralia," search for ^[pentrali]*$. Then in the second search, search for words that have no repeated letters besides a and e by searching for words that do not match the regular expression \([^ae]\).*\1 (the expression matches words with a repeated character other than a or e). Finally, combine this with the non-case-sensitive options on both searches, the option to search only alphabetic characters, and a word length of 10.

This isn't perfect, as it may find words which have more than two occurrences of a and/or e, and correspondingly fewer occurrences of the other of a and e, or none at all of some other letter. (On the ENABLE list, it finds planetaria, which has an extra a and no e. On other lists it finds a few other words but no exact transposals, besides, of course, penetralia itself.)