So far we have been able to extract a substring by specifying its start and end locations within a string like this:
What if we need something more sophisticated than this? Let’s say we want to extract the launch time of Apollo 11 (09:32 a.m.) hiding inside the following text.
It’s possible to accomplish this by using string functions, but it’s also tedious. With regular expressions it’s a piece of cake once we know how to define a pattern. Let’s install the elm/regex
package which contains functions and values for working with regular expressions. Run the following command from the beginning-elm
directory in terminal.
Answer Y
and start an elm repl
session. Now import the Regex
module which resides in the elm/regex
package.
Next, we’ll define a pattern for 09:32 a.m.
and use various functions in the Regex
module to check whether that substring exists in the text above. Don’t worry about understanding the code below yet. It’s there to give you an idea of how regular expressions work in Elm. We’ll go over it in detail after we understand the basics of regular expression.
Regular Expression Basics
A regular expression (regex) is a pattern for matching character combinations in a string. It is an incredibly powerful tool for working with strings, yet the basic concepts behind it aren’t that complex. Before we attempt to understand the code above, let’s get familiar with the basics of regular expressions first. After that we will learn how to use them in Elm.
Note: If you already know how regular expressions work, feel free to skip to the Regular Expressions in Elm section below.
Matching a single character
In a regular expression, letters and numbers match themselves. For example, A
will match A
and 1
will match 1
. All regular expressions are case-sensitive, so A
won’t match a
.
Matching multiple characters
To match multiple characters, we just need to repeat the character. AAA
will match three A
s in a row and 123
will match the number 123
.
Dot character
Instead of repeating a character, we can use the dot character to match multiple characters like this: A..
. The dot character matches a single character. A character can be a letter, number, or any special character. So A..
will match AAA
, ABB
, ACC
, ABC
, and so on. It will also match A12
, A$3
, and A^%
. Punctuation characters such as .
allow us to create patterns instead of explicitly specifying all characters in a substring.
The punctuation characters have special meanings in a regular expression. So if we want to match a literal dot, we must precede it with a backslash which turns off dot’s special meaning allowing regex to treat it as a regular character. Here are some examples that match a literal dot.
-
\.
will match a literal dot. -
Dr\. Strange
will match Dr. Strange. -
Kevin Malone, Esq\.
will match Kevin Malone, Esq. -
7\.67
will match 7.67
Matching a set
So far we have focused on matching a specific character. What if we want to match the characters of same kind, for example only numbers? We can use a set for that. Sets are created by wrapping the characters in square brackets. Here are some examples:
-
[0123456789]
will match a single digit. -
[aeiou]
will match a single lowercase vowel. -
[AEIOU]
will match a single uppercase vowel.
It’s important to note that a set will match only a single character. For example [Pp]
will match a single character (either P
or p
), not PP
, pp
, or Pp
. If we want to match more than one character, we can just repeat the set like this: [Pp][Pp]
and it will match all of these combinations: PP
, Pp
, pp
, pP
.
One of the reasons why regular expressions are so powerful is because we can build more and more complex patterns by progressively combining different kinds of expressions.
Using ranges to make sets more succinct
Sets don’t scale well. For example, if we want to match two lowercase letters, we will have to use this: [abcdefghijklmnopqrstuvwxyz][abcdefghijklmnopqrstuvwxyz]
. Yikes! This is where a range comes handy. Instead of typing each element in a set, we just define a range. We can rewrite that super long regex with ranges like this: [a-z][a-z]
. All we need to do is specify the beginning and end of a sequence of characters separated by a dash.
Similarly, [0-9]
will match any digit. And [A-Za-z0-9_]
will match any word character (letter, number, or underscore). Since certain character sets are used often, regex provides a series of shorthands for representing them. Here are some examples:
-
\d
will match any digit. It is short for[0-9]
. -
\w
will match any word character. It is short for[A-Za-z0-9_]
. -
\s
will match any whitespace character. It is short for[\t\r\n\f]
. -
\s\d
will match a whitespace character followed by a digit. -
[\da-f]
will match a hexadecimal digit. It is short for[0-9a-f]
. Here we combined a shorthand with a range. We can combine expressions however we want.
Matching alternatives
Sometimes we need to match this character or that character. We can use a vertical bar for that. Here are some examples:
-
X|Y
will match eitherX
orY
. -
EST|PST
will match eitherEST
orPST
. -
Jim|Pam
will match the name of one of the two biggest pranksters in Dunder Mifflin’s history:Jim
orPam
. -
am|a\.m\.|pm|p\.m\.
will matcham
ora.m.
, orpm
orp.m.
-
[0-9]|[a-zA-Z]
will match a digit or a letter.
Matching zero or more characters with asterisk
Asterisk (*
) is probably the most powerful character in a regular expression. Like the dot character, it has a special meaning. It matches zero or more of the thing that came just before it. Let’s see some examples.
-
a*
will matcha
. It will also matchaa
oraaa
oraaaa
oraaaaa
and so on. Herea
is the thing that came before*
. So the pattern will match zero or morea
s. Because of that it will also match an empty string. -
ab*
will matcha
,ab
,abb
,abbb
,abbbb
, etc.
The asterisk doesn’t have to be at the end. We can put it anywhere we want.
-
a*b
will matchb
,ab
,aab
,aaab
,aaaab
, etc. -
Cree*d
will matchCreed
,Creeed
,Creeeed
,Creeeeed
, etc.
We can’t put it in the front though because then there won’t be anything to repeat. Since regular expressions don’t put any limitations on how we combine expressions, we can use *
with sets or ranges too.
-
[0-9]*
will match any number of digits. -
[a-z]*
will match any number of lowercase letters. -
[a-zA-Z0-9_]*
will match any number of word characters (letter, number, or underscore).
We can even combine *
with special shorthands like this:
-
\d*
will match any number of digits. -
\w*
will match any number of word characters. -
\s*
will match any number of white spaces.
Mother of all regular expressions
Remember, earlier I said *
is probably the most powerful character in a regular expression. There is a reason for that. If we combine it with the dot character, we can create the mother of all regular expressions: .*
. It will match any number of any characters. Basically, it will match anything. That is because the dot matches a single character (doesn’t matter what character it is) and the asterisk matches any number of characters represented by the dot. If it matches everything, how can it be useful? Let’s see some examples:
-
.*Kramer
will match the name of anyone whose last name is Kramer. -
Cosmo.*
will match the name of anyone whose first name is Cosmo. -
.*Sacamano.*
will match the name that has Sacamano somewhere in it.
That should be enough basics about regular expressions for you to feel comfortable with the examples shown below.
Note: There is so much more to regular expressions. We barely scratched the surface here. If you are interested in exploring them further, there are plenty of resources online including tools for building and testing complex regular expressions.
Regular Expressions in Elm
Now that we have the basics of regular expressions down, it’s time to understand that glorious pattern we wrote earlier for extracting the launch time of Apollo 11. Here is that code again:
Let’s go through it step-by-step.
Step 1: Start by importing the Regex
module.
Unlike the String
module, Regex
doesn’t come preloaded with the Elm Platform. That’s why we had to install the elm/regex
package separately. For the rest of the examples in this section, we will assume that Regex
is already imported.
Step 2: Define a pattern that matches the time (09:32 a.m.) we’re looking for.
The following diagram illustrates how the pattern above matches 09:32 a.m..
- \ vs \\
- When we were learning the basics of regular expressions earlier, we used only one
\
to either use a special shorthand or escape a dot like this:\d
and\.
. Why then do we need two\
s in Elm code? That’s because\
has a special purpose in Elm — escape other characters. When you place it before another character, it removes the second character’s special Elm meaning and lets it be a regular character instead. -
If you recall from the String section earlier, when we used a double quote inside a single-line string, we had to escape it like this so the string wouldn’t end early:
"Michael Scott's Rabies Awareness \"Fun Run\" Race for the Cure"
. -
If you want to use a literal
\
, it needs to be escaped just like double quotes do. Put two\\
in a row and Elm will understand that you want to use the literal\
character. If you use just one\
Elm will think you are telling it to remove the next character’s special Elm meaning.
Step 3: Create a regular expression by passing pattern
to the Regex.fromString
function.
Not all strings are valid regex patterns. That’s why the fromString
function returns a Maybe
value.
- Maybe
- When Elm cannot guarantee a value, it returns a data structure called
Maybe
. If the value is present, it’s wrapped insideJust
, otherwiseNothing
is returned.Just
andNothing
are members of theMaybe
type. This simple concept is at the heart of writing incredibly robust applications in Elm. The next chapter will coverMaybe
in much more detail. Until then you can think ofMaybe
as a container that can hold at most one value.
Step 4: Extract the regular expression inside Maybe
container using the Maybe.withDefault
function.
The following diagram illustrates how Maybe.withDefault
works.
Step 5: Define the string that contains the launch time of Appolo 11 (09:32 a.m.).
Step 6: Use the Regex.contains
function to check if apollo11
has any substrings that match the pattern
defined in step 2.
Note: The String
module also has a function named contains
which works differently than Regex.contains
. This is the second time we encountered two functions with the same name from different modules. This is quite common in Elm. In addition to grouping similar functions, a module also acts as a namespace. Therefore, String.contains
and Regex.contains
are two completely separate functions.
Extracting a Substring
We set out to extract the substring 09:32 a.m., but all we have done so far is verify that it exists. To extract it, we will use the Regex.find
function which is much more powerful than String.slice
explained in the Substrings section. Regex.find
and Regex.contains
both take the exact same arguments:
1. A regular expression pattern that represents a substring.
2. Original string where the substring is hiding.
The output is easier to read if it’s formatted like this:
Regex.find
doesn’t return just the substring we’re looking for. It returns a list of records each containing information about the matches.
- Record
- For now, you can think of a record as a collection of key value pairs. We will cover it in detail later in the Record section.
As the output above shows, Regex.find
returns four pieces of information about each match it found:
-
index: The index of the matched substring in the original string.
-
match: The substring we are looking for.
-
number: If multiple substrings are found,
Regex.find
labels each match with a number starting at one. The first match is labeled1
, the second is labeled2
, and so on. This number will be important when replacing all occurrences of a substring later. -
submatches:
Regex.find
also looks for substrings that match any subpattern included inside the original pattern. In our case, the substring “a.m.” matches the subpattern(a\\.m\\.|p\\.m\\.)
. All submatches are wrapped insideJust
. The examples in this section don’t take advantage of submatches, so it’s safe to ignore them for now.
We are one more step away from extracting our sneaky substring friend. We just need to figure out a way to get to that match
key inside the record. We will use a function called List.map
for that.
Finally, we have our substring. Well, it’s still wrapped in a list, but we will have to wait until the next section to find out how to free it from the clutches of List
.
- List.map
List.map
creates a new list with the results of applying a provided function to every element in the original list. Just like theString.filter
function we saw in the previous section, we give it an anonymous function that reads the value stored inmatch
key. We’ll cover it in detail in the Mapping a List section.
Finding Multiple Occurrences of a Substring
Let’s see an example of a string that has multiple occurrences of a substring we’re looking for.
Here’s the output after some formatting:
Regex.find
found four occurrences of the substring "quitter"
. Notice how the value for number
key is incremented for each match.
Replacing a Substring
Although George likes to brag about how good of a quitter he is, let’s make him a go-getter for once. We’ll replace all occurrences of the substring "quitter"
with "go-getter"
in the original string. To do that we will need to use the Regex.replace
function.
Here’s how the output looks after some formatting:
\n
characters and extra spaces have been removed to make the output look nicer. In the original output, you’ll see an extra space and \n
in the beginning. That’s because when we defined the string
constant with multi-line syntax, we used a space and \n
to make the code look nicer.
We could have just begun our string
in the first line like this:
The Regex.replace
function takes three arguments:
1. A regular expression that contains the pattern for matching the substring we want to replace.
2. A function that takes the matched substring as an argument and returns a replacement. Notice how we used _
in place of a parameter name inside the anonymous function. Since our function simply returns a new substring ("go-getter"
) and doesn’t modify the original substring ("quitter"
) in any form, we are not interested in the argument. By using _
, we are essentially ignoring the argument.
3. The original string that contains the substring we want to replace.
Splitting a String
In the Splitting a String section earlier, we learned how to split a string by using the String.split
function. That function is limited to splitting a string based only on a separator. What if we need to split a string using a complex pattern? For example, the following string contains two instances of time: 09:32 and 10:56
Let’s say we want to split it right where the times appear into three substrings like this:
Not sure why we would arbitrarily split a string like that, but if we do we will need something more flexible than String.split
. As it so happens, the Regex
module also provides a function called split
which uses a regular expression pattern to break a string into a list of substrings. Let’s try it out.
Here’s the output after some formatting:
Removing Case Sensitivity
Regular expressions are case-sensitive by default. If we want to make them case-insensitive, we need to use the Regex.fromStringWith
function to create a regex. Let’s try it out.
Regex.fromStringWith
takes an options record that not only lets us specify case sensitivity but also the multiline status.
We covered almost everything in the Regex
module, but you can still check out its full documentation to learn more.