So far we have been able to extract a substring by specifying its start and end locations within a string. What if we need something more sophisticated than this? Let’s say we want to extract the launch time of Apollo 11 (09:32 a.m.) hiding inside the following text.
It’s possible to accomplish this by using string functions, but it’s also tedious. With regular expression it’s a piece of cake once we know how to define a pattern.
A regular expression (regex) is a pattern for matching character combinations in a string. It is an incredibly powerful tool for working with strings, yet the basic concepts behind it aren’t that complex. Before we attempt to understand the code above, let’s get familiar with the basics of regex first. After that we will come back to using them in Elm.
If you already know how regular expressions work, feel free to skip the Regular Expression Basics section below.
Regular Expression Basics
Matching a single character
In a regex, letters and numbers match themselves. For example,
A will match
1 will match
1. Regex is case-sensitive. So
A won’t match
Matching multiple characters
To match multiple characters, we just repeat the character.
AAA will match three
A’s in a row and
123 will match the number
Instead of repeating a character, we can use the dot character to match multiple characters like this:
A... The dot character matches a single character. A character can be a letter, number or any special character. So
A.. will match
ABC, and so on. It will also match
A^%, and so on. Punctuation characters such as
. allow us to create patterns instead of explicitly specifying all characters in a substring.
The punctuation characters have special meanings in regex. So if we want to match a literal dot, we must precede it with a backslash which turns off dot’s special meaning allowing regex to treat it as a regular character. Here are some examples that match a literal dot.
\.will match a literal dot.
Dr\. Strangewill match Dr. Strange.
Kevin Malone, Esq\.will match Kevin Malone, Esq.
7\.67will match 7.67
Matching a set
So far we have focused on matching a specific character. What if we want to match the characters of same kind, for example only numbers? We can use set for that. Sets are created by wrapping the characters in square brackets. Here are some examples:
will match a single digit.
[aeiou]will match a single lowercase vowel.
[AEIOU]will match a single uppercase vowel.
It’s important to note that a set will match only a single character. For example
[Pp] will match a single character (either
Pp. If we want to match more than one character, we can just repeat the set like this:
[Pp][Pp] and it will match all of these combinations:
One of the reasons why regexes are so powerful is because we can build more and more complex patterns by progressively combining different kinds of expressions.
Using ranges to make sets more succinct
Sets don’t scale well. For example, if we want to match two lowercase letters, we will have to use this:
[abcdefghijklmnopqrstuvwxyz][abcdefghijklmnopqrstuvwxyz]. Yikes! This is where a range comes handy. Instead of typing each element in a set, we just define a range. We can rewrite that super long regex with ranges like this:
[a-z][a-z]. All we need to do is specify the beginning and end of a sequence of characters separated by a dash.
[0-9] will match any digit. And
[A-Za-z0-9_] will match any word character (letter, number, or underscore). Since certain character sets are used often, regex provides a series of shorthands for representing them. Here are some examples:
\dwill match any digit. It is short for
\wwill match any word character. It is short for
\swill match any whitespace character. It is short for
\s\dwill match a whitespace character followed by a digit.
[\da-f]will match a hexadecimal digit. It is short for
[0-9a-f]. Here we combined a shorthand with a range. We can combine expressions however we want.
Sometimes we need to match this character or that character. We can use a vertical bar for that. Here are some examples:
X|Ywill match either
EST|PSTwill match either
Jim|Pamwill match the name of one of the two biggest pranksters,
[0-9]|[a-zA-Z]will match a digit or a letter
Matching zero or more characters with asterisk
*) is probably the most powerful character in regex. Like the dot character, it has a special meaning. It matches zero or more of the thing that came just before it. Let’s see some examples.
a. It will also match
aaaaaand so on and so forth. Here “a” is the thing that came before
*. So the pattern will match zero or more
a’s. Because of that it will also match an empty string.
The asterisk doesn’t have to be at the end. We can put it anywhere we want.
We can’t put it on the front though because then there won’t be anything to repeat. Since regex doesn’t put any limitations on how we combine expressions, we can use
* with sets or ranges like this:
[0-9]*will match any number of digits
[a-z]*will match any number of lowercase letters
[a-zA-Z0-9_]*will match any number of word characters (letter, number, or underscore)
We can even combine
* with special shorthands like this:
\d*will match any number of digits
\w*will match any number of word characters
\s*will match any number of white spaces
Mother of all regexes
Remember, earlier I said
* is probably the most powerful character in regex. There is a reason for that. If we combine it with the dot character, we can create the mother of all regexes:
.*. It will match any number of any characters. Basically, it will match anything. That is because the dot matches a single character (doesn’t matter what character it is) and the asterisk matches any number of characters represented by the dot. If it matches everything, how can it be useful? Let’s see some examples:
.*Kramerwill match the name of anyone whose last name is Kramer.
Cosmo.*will match the name of anyone whose first name is Cosmo.
.*Sacamano.*will match the name that has Sacamano somewhere in it.
That should be enough basics about regular expressions for you to feel comfortable with the examples shown below. There is so much more to regular expressions that we barely scratched its surface. Multiple books have been written on the subject. There are also plenty of online tools for building and testing complex regexes.
Regular Expressions in Elm
Now that we have the basics of regular expressions down, it’s time to understand that glorious pattern we wrote for extracting the launch time of Apollo 11 at the beginning of this section. Here is that code again:
We start by importing the
Regex module. Like the
String module, it too comes preloaded with the Elm Platform, but doesn’t get loaded into the repl automatically. For the rest of the examples in this section, we will assume that the
Regex module is already imported.
Next we use the
regex function to define a pattern that will match the time (09:32 a.m.) we’re looking for. The string where the time is hidden is defined with a multi-line syntax. Finally, we use the
contains function from
Regex module (not the one from
String module) to check if any substring in the string matches our pattern.
This is the second time we encountered two functions with the same name from different modules. This is quite common in Elm. In addition to grouping similar functions, modules also act as namespaces. Therefore,
Regex.contains are two completely separate functions.
The pattern we used as an argument to the
regex function looks confusing, doesn’t it? When we were learning the basics of regular expressions in the previous section, we used only one
\ to either use a special shorthand or escape dot like this:
\.. Why then do we need two
\s in Elm code?
\ has a special purpose in Elm - to escape other characters. When you place it before another character, it removes the second character’s special Elm meaning and lets it be a regular character instead. If you recall from the String section, when we used a double quote inside a single-line string, we had to escape it like this so the string wouldn’t end early:
"Michael Scott's Rabies Awareness \"Fun Run\" Race for the Cure". If you want to use a literal
\ it needs to be escaped just like double quotes do. Put two
\\ in a row and Elm will understand that you want to use the literal
\ character. If you use just one
\ Elm will think you are telling it to remove the next character’s special Elm meaning.
Extracting a Substring
We set out to extract the substring 09:32 a.m., but all we have done so far is verify that it exists. To extract it we will use the
Regex.find function which is much more powerful than the
slice function from the
find takes three parameters:
The number of occurrences of the substring we want to find. For example,
Allwill find all occurrences, whereas
Atmost 2will find at most two occurrences.
Regular expression pattern representing the substring.
Original string where the substring is hiding.
Let’s use the
find function to match 09:32 a.m.
The output is a little hard to read. So let me format it to make it look nicer.
find doesn’t return just the substring we’re looking for. It returns a list of records containing information about the matches.
For now, you can think of a record as a collection of key value pairs. We will cover it in detail in the Record section.
As the output above shows,
find returns four pieces of information about each match it found:
1. The substring we are looking for.
2. Submatches -
find also looks for substrings that match any subpattern included inside the original pattern. In our case, the substring “a.m.” matches the subpattern
(a\\.m\\.|p\\.m\\.). All submatches are wrapped inside
Don’t worry about
Just for now; we’ll cover it later. The examples in this section don’t take advantage of submatches. So it’s safe to ignore them too for now.
3. The index of the matched substring in the original string.
find matches multiple substrings, it labels each match with a number starting at one. The first match is labeled
1, the second is labeled
2, and so on and so forth. This number will be important when replacing all occurrences of a substring later.
We are one more step away from extracting our sneaky substring friend. We just need to figure out a way to get to that
match key inside the record. We will use a function called
List.map for that.
Finally, we have our substring. Well, it’s still wrapped in a list, but we will have to wait until the next section to find out how to free it from the clutches of
List.mapcreates a new list with the results of applying a provided function to every element in the original list. Similarly to the
String.filterfunction we saw in the previous section, we give it an anonymous function that reads the value stored in
matchkey. We’ll cover it in detail in the Mapping a List section.
Finding multiple occurrences of a substring
Let’s see an example of a string that has multiple occurrences of a substring we’re looking for.
Here’s the output after some formatting:
This time we provided
Regex.All as the first argument to
find because we are looking for all occurrences of the substring “quitter”. It found four of those. Notice how the value for
number key is incremented for each match.
Replacing a Substring
Although George likes to brag about how good of a quitter he is, let’s make him a go-getter for once. We’ll replace all occurrences of the substring “quitter” with “go-getter” in the original string. To do that we will have to use the
Here’s how the output looks after some formatting:
I removed the
\n characters and extra spaces to make the output look nicer. You will also see an extra space and
\n character in the beginning of the resulting string. That’s because when we defined the
string constant with multi-line syntax, we used a space and
\n to make the code look nicer.
We could have just well begun our string in the first line like this:
replace function takes four arguments:
How many matches to replace. We passed
Regex.Allbecause we want to replace all occurrences of “quitter”. If we want to replace just the first occurrence, we can pass
Pattern for matching the substring we want to replace.
A function that takes the matched substring as an argument and returns an alternative. Notice how we used
_in place of a parameter name in our anonymous function. Since our function simply returns a completely new substring (“go-getter”) and doesn’t attempt to modify the original substring (“quitter”) in any form, we are not interested in what’s passed to our function. By using
_, we are essentially ignoring the argument.
The original string that contains the substring we want to replace.
Splitting a String
We already know how to split a string using the
String.split function. But that function is limited to splitting a string based only on a separator. What if we need to split a string using a complex pattern? For example, the following string contains two instances of time: 09:32 and 10:56
Let’s say we want to split it right where the times appear into three substrings like this:
Not sure why we would arbitrarily split a string like that, but if we do we will need something more flexible than
String.split. As it so happens, the
Regex module also provides a function called
split which uses a pattern to break a string into a list of substrings.
Here’s the output after some formatting:
Removing case sensitivity
Regular expressions are case-sensitive by default. Here’s an example:
We can turn this behavior off by running our pattern through the
We covered almost everything in the
Regex module, but you can check out the full documentation here.