So far we have been able to extract a substring by specifying its start and end locations within a string like this:
What if we need something more sophisticated than this? Let’s say we want to extract the launch time of Apollo 11 (09:32 a.m.) hiding inside the following text.
It’s possible to accomplish this by using string functions, but it’s also tedious. With regular expressions it’s a piece of cake once we know how to define a pattern. Let’s install the
elm/regex package which contains functions and values for working with regular expressions. Run the following command from the
beginning-elm directory in terminal.
Y and start an
elm repl session. Now import the
Regex module which resides in the
Next, we’ll define a pattern for
09:32 a.m. and use various functions in the
Regex module to check whether that substring exists in the text above. Don’t worry about understanding the code below yet. It’s there to give you an idea of how regular expressions work in Elm. We’ll go over it in detail after we understand the basics of regular expression.
Regular Expression Basics
A regular expression (regex) is a pattern for matching character combinations in a string. It is an incredibly powerful tool for working with strings, yet the basic concepts behind it aren’t that complex. Before we attempt to understand the code above, let’s get familiar with the basics of regular expressions first. After that we will learn how to use them in Elm.
Note: If you already know how regular expressions work, feel free to skip to the Regular Expressions in Elm section below.
Matching a single character
In a regular expression, letters and numbers match themselves. For example,
A will match
1 will match
1. All regular expressions are case-sensitive, so
A won’t match
Matching multiple characters
To match multiple characters, we just need to repeat the character.
AAA will match three
As in a row and
123 will match the number
Instead of repeating a character, we can use the dot character to match multiple characters like this:
A... The dot character matches a single character. A character can be a letter, number, or any special character. So
A.. will match
ABC, and so on. It will also match
A^%. Punctuation characters such as
. allow us to create patterns instead of explicitly specifying all characters in a substring.
The punctuation characters have special meanings in a regular expression. So if we want to match a literal dot, we must precede it with a backslash which turns off dot’s special meaning allowing regex to treat it as a regular character. Here are some examples that match a literal dot.
\.will match a literal dot.
Dr\. Strangewill match Dr. Strange.
Kevin Malone, Esq\.will match Kevin Malone, Esq.
7\.67will match 7.67
Matching a set
So far we have focused on matching a specific character. What if we want to match the characters of same kind, for example only numbers? We can use a set for that. Sets are created by wrapping the characters in square brackets. Here are some examples:
will match a single digit.
[aeiou]will match a single lowercase vowel.
[AEIOU]will match a single uppercase vowel.
It’s important to note that a set will match only a single character. For example
[Pp] will match a single character (either
Pp. If we want to match more than one character, we can just repeat the set like this:
[Pp][Pp] and it will match all of these combinations:
One of the reasons why regular expressions are so powerful is because we can build more and more complex patterns by progressively combining different kinds of expressions.
Using ranges to make sets more succinct
Sets don’t scale well. For example, if we want to match two lowercase letters, we will have to use this:
[abcdefghijklmnopqrstuvwxyz][abcdefghijklmnopqrstuvwxyz]. Yikes! This is where a range comes handy. Instead of typing each element in a set, we just define a range. We can rewrite that super long regex with ranges like this:
[a-z][a-z]. All we need to do is specify the beginning and end of a sequence of characters separated by a dash.
[0-9] will match any digit. And
[A-Za-z0-9_] will match any word character (letter, number, or underscore). Since certain character sets are used often, regex provides a series of shorthands for representing them. Here are some examples:
\dwill match any digit. It is short for
\wwill match any word character. It is short for
\swill match any whitespace character. It is short for
\s\dwill match a whitespace character followed by a digit.
[\da-f]will match a hexadecimal digit. It is short for
[0-9a-f]. Here we combined a shorthand with a range. We can combine expressions however we want.
Sometimes we need to match this character or that character. We can use a vertical bar for that. Here are some examples:
X|Ywill match either
EST|PSTwill match either
Jim|Pamwill match the name of one of the two biggest pranksters in Dunder Mifflin’s history:
[0-9]|[a-zA-Z]will match a digit or a letter.
Matching zero or more characters with asterisk
*) is probably the most powerful character in a regular expression. Like the dot character, it has a special meaning. It matches zero or more of the thing that came just before it. Let’s see some examples.
a. It will also match
aaaaaand so on. Here
ais the thing that came before
*. So the pattern will match zero or more
as. Because of that it will also match an empty string.
The asterisk doesn’t have to be at the end. We can put it anywhere we want.
We can’t put it in the front though because then there won’t be anything to repeat. Since regular expressions don’t put any limitations on how we combine expressions, we can use
* with sets or ranges too.
[0-9]*will match any number of digits.
[a-z]*will match any number of lowercase letters.
[a-zA-Z0-9_]*will match any number of word characters (letter, number, or underscore).
We can even combine
* with special shorthands like this:
\d*will match any number of digits.
\w*will match any number of word characters.
\s*will match any number of white spaces.
Mother of all regular expressions
Remember, earlier I said
* is probably the most powerful character in a regular expression. There is a reason for that. If we combine it with the dot character, we can create the mother of all regular expressions:
.*. It will match any number of any characters. Basically, it will match anything. That is because the dot matches a single character (doesn’t matter what character it is) and the asterisk matches any number of characters represented by the dot. If it matches everything, how can it be useful? Let’s see some examples:
.*Kramerwill match the name of anyone whose last name is Kramer.
Cosmo.*will match the name of anyone whose first name is Cosmo.
.*Sacamano.*will match the name that has Sacamano somewhere in it.
That should be enough basics about regular expressions for you to feel comfortable with the examples shown below.
Note: There is so much more to regular expressions. We barely scratched the surface here. If you are interested in exploring them further, there are plenty of resources online including tools for building and testing complex regular expressions.
Regular Expressions in Elm
Now that we have the basics of regular expressions down, it’s time to understand that glorious pattern we wrote earlier for extracting the launch time of Apollo 11. Here is that code again:
Let’s go through it step-by-step.
Step 1: Start by importing the
Regex doesn’t come preloaded with the Elm Platform. That’s why we had to install the
elm/regex package separately. For the rest of the examples in this section, we will assume that
Regex is already imported.
Step 2: Define a pattern that matches the time (09:32 a.m.) we’re looking for.
The following diagram illustrates how the pattern above matches 09:32 a.m..
- \ vs \\
- When we were learning the basics of regular expressions earlier, we used only one
\to either use a special shorthand or escape a dot like this:
\.. Why then do we need two
\s in Elm code? That’s because
\has a special purpose in Elm — escape other characters. When you place it before another character, it removes the second character’s special Elm meaning and lets it be a regular character instead.
If you recall from the String section earlier, when we used a double quote inside a single-line string, we had to escape it like this so the string wouldn’t end early:
"Michael Scott's Rabies Awareness \"Fun Run\" Race for the Cure".
If you want to use a literal
\, it needs to be escaped just like double quotes do. Put two
\\in a row and Elm will understand that you want to use the literal
\character. If you use just one
\Elm will think you are telling it to remove the next character’s special Elm meaning.
Step 3: Create a regular expression by passing
pattern to the
Not all strings are valid regex patterns. That’s why the
fromString function returns a
- When Elm cannot guarantee a value, it returns a data structure called
Maybe. If the value is present, it’s wrapped inside
Nothingare members of the
Maybetype. This simple concept is at the heart of writing incredibly robust applications in Elm. The next chapter will cover
Maybein much more detail. Until then you can think of
Maybeas a container that can hold at most one value.
Step 4: Extract the regular expression inside
Maybe container using the
The following diagram illustrates how
Step 5: Define the string that contains the launch time of Appolo 11 (09:32 a.m.).
Step 6: Use the
Regex.contains function to check if
apollo11 has any substrings that match the
pattern defined in step 2.
String module also has a function named
contains which works differently than
Regex.contains. This is the second time we encountered two functions with the same name from different modules. This is quite common in Elm. In addition to grouping similar functions, a module also acts as a namespace. Therefore,
Regex.contains are two completely separate functions.
Extracting a Substring
We set out to extract the substring 09:32 a.m., but all we have done so far is verify that it exists. To extract it, we will use the
Regex.find function which is much more powerful than
String.slice explained in the Substrings section.
Regex.contains both take the exact same arguments:
1. A regular expression pattern that represents a substring.
2. Original string where the substring is hiding.
The output is easier to read if it’s formatted like this:
Regex.find doesn’t return just the substring we’re looking for. It returns a list of records each containing information about the matches.
- For now, you can think of a record as a collection of key value pairs. We will cover it in detail later in the Record section.
As the output above shows,
Regex.find returns four pieces of information about each match it found:
index: The index of the matched substring in the original string.
match: The substring we are looking for.
number: If multiple substrings are found,
Regex.findlabels each match with a number starting at one. The first match is labeled
1, the second is labeled
2, and so on. This number will be important when replacing all occurrences of a substring later.
Regex.findalso looks for substrings that match any subpattern included inside the original pattern. In our case, the substring “a.m.” matches the subpattern
(a\\.m\\.|p\\.m\\.). All submatches are wrapped inside
Just. The examples in this section don’t take advantage of submatches, so it’s safe to ignore them for now.
We are one more step away from extracting our sneaky substring friend. We just need to figure out a way to get to that
match key inside the record. We will use a function called
List.map for that.
Finally, we have our substring. Well, it’s still wrapped in a list, but we will have to wait until the next section to find out how to free it from the clutches of
List.mapcreates a new list with the results of applying a provided function to every element in the original list. Just like the
String.filterfunction we saw in the previous section, we give it an anonymous function that reads the value stored in
matchkey. We’ll cover it in detail in the Mapping a List section.
Finding Multiple Occurrences of a Substring
Let’s see an example of a string that has multiple occurrences of a substring we’re looking for.
Here’s the output after some formatting:
Regex.find found four occurrences of the substring
"quitter". Notice how the value for
number key is incremented for each match.
Replacing a Substring
Although George likes to brag about how good of a quitter he is, let’s make him a go-getter for once. We’ll replace all occurrences of the substring
"go-getter" in the original string. To do that we will need to use the
Here’s how the output looks after some formatting:
\n characters and extra spaces have been removed to make the output look nicer. In the original output, you’ll see an extra space and
\n in the beginning. That’s because when we defined the
string constant with multi-line syntax, we used a space and
\n to make the code look nicer.
We could have just begun our
string in the first line like this:
Regex.replace function takes three arguments:
1. A regular expression that contains the pattern for matching the substring we want to replace.
2. A function that takes the matched substring as an argument and returns a replacement. Notice how we used
_ in place of a parameter name inside the anonymous function. Since our function simply returns a new substring (
"go-getter") and doesn’t modify the original substring (
"quitter") in any form, we are not interested in the argument. By using
_, we are essentially ignoring the argument.
3. The original string that contains the substring we want to replace.
Splitting a String
In the Splitting a String section earlier, we learned how to split a string by using the
String.split function. That function is limited to splitting a string based only on a separator. What if we need to split a string using a complex pattern? For example, the following string contains two instances of time: 09:32 and 10:56
Let’s say we want to split it right where the times appear into three substrings like this:
Not sure why we would arbitrarily split a string like that, but if we do we will need something more flexible than
String.split. As it so happens, the
Regex module also provides a function called
split which uses a regular expression pattern to break a string into a list of substrings. Let’s try it out.
Here’s the output after some formatting:
Removing Case Sensitivity
Regular expressions are case-sensitive by default. If we want to make them case-insensitive, we need to use the
Regex.fromStringWith function to create a regex. Let’s try it out.
Regex.fromStringWith takes an options record that not only lets us specify case sensitivity but also the multiline status.
We covered almost everything in the
Regex module, but you can still check out its full documentation to learn more.