3.15

Regular Expression

So far we have been able to extract a substring by specifying its start and end locations within a string. What if we need something more sophisticated than this? Let’s say we want to extract the launch time of Apollo 11 (09:32 a.m.) hiding inside the following text.

On July 16, 1969, the massive Saturn V rocket lifted
off from NASA's Kennedy Space Center at 09:32 a.m. EDT.
Four days later, on July 20, Neil Armstrong and Buzz Aldrin
landed on the Moon.

It’s possible to accomplish this by using string functions, but it’s also tedious. With regular expression it’s a piece of cake once we know how to define a pattern.

> import Regex
> pattern = Regex.regex "\\d\\d:\\d\\d (a\\.m\\.|p\\.m\\.)"
{}

> string = """ \
| On July 16, 1969, the massive Saturn V rocket \
| lifted off from NASA's Kennedy Space Center at \
| 09:32 a.m. EDT. Four days later, on July 20, Neil \
| Armstrong and Buzz Aldrin landed on the Moon. \
| """

> Regex.contains pattern string
True

A regular expression (regex) is a pattern for matching character combinations in a string. It is an incredibly powerful tool for working with strings, yet the basic concepts behind it aren’t that complex. Before we attempt to understand the code above, let’s get familiar with the basics of regex first. After that we will come back to using them in Elm.

If you already know how regular expressions work, feel free to skip the Regular Expression Basics section below.

Regular Expression Basics

Matching a single character

In a regex, letters and numbers match themselves. For example, A will match A and 1 will match 1. Regex is case-sensitive. So A won’t match a.

Matching multiple characters

To match multiple characters, we just repeat the character. AAA will match three A’s in a row and 123 will match the number 123.

Dot character

Instead of repeating a character, we can use the dot character to match multiple characters like this: A... The dot character matches a single character. A character can be a letter, number or any special character. So A.. will match AAA, ABB, ACC, ABC, and so on. It will also match A12, A$3, A^%, and so on. Punctuation characters such as . allow us to create patterns instead of explicitly specifying all characters in a substring.

The punctuation characters have special meanings in regex. So if we want to match a literal dot, we must precede it with a backslash which turns off dot’s special meaning allowing regex to treat it as a regular character. Here are some examples that match a literal dot.

  • \. will match a literal dot.

  • Dr\. Strange will match Dr. Strange.

  • Kevin Malone, Esq\. will match Kevin Malone, Esq.

  • 7\.67 will match 7.67

Matching a set

So far we have focused on matching a specific character. What if we want to match the characters of same kind, for example only numbers? We can use set for that. Sets are created by wrapping the characters in square brackets. Here are some examples:

  • [0123456789] will match a single digit.

  • [aeiou] will match a single lowercase vowel.

  • [AEIOU] will match a single uppercase vowel.

It’s important to note that a set will match only a single character. For example [Pp] will match a single character (either P or p), not PP, pp, or Pp. If we want to match more than one character, we can just repeat the set like this: [Pp][Pp] and it will match all of these combinations: PP, Pp, pp, pP.

One of the reasons why regexes are so powerful is because we can build more and more complex patterns by progressively combining different kinds of expressions.

Using ranges to make sets more succinct

Sets don’t scale well. For example, if we want to match two lowercase letters, we will have to use this: [abcdefghijklmnopqrstuvwxyz][abcdefghijklmnopqrstuvwxyz]. Yikes! This is where a range comes handy. Instead of typing each element in a set, we just define a range. We can rewrite that super long regex with ranges like this: [a-z][a-z]. All we need to do is specify the beginning and end of a sequence of characters separated by a dash.

Similarly, [0-9] will match any digit. And [A-Za-z0-9_] will match any word character (letter, number, or underscore). Since certain character sets are used often, regex provides a series of shorthands for representing them. Here are some examples:

  • \d will match any digit. It is short for [0-9].

  • \w will match any word character. It is short for [A-Za-z0-9_].

  • \s will match any whitespace character. It is short for [\t\r\n\f].

  • \s\d will match a whitespace character followed by a digit.

  • [\da-f] will match a hexadecimal digit. It is short for [0-9a-f]. Here we combined a shorthand with a range. We can combine expressions however we want.

Matching alternatives

Sometimes we need to match this character or that character. We can use a vertical bar for that. Here are some examples:

  • X|Y will match either X or Y

  • EST|PST will match either EST or PST

  • Jim|Pam will match the name of one of the two biggest pranksters, Jim or Pam

  • am|a\.m\.|pm|p\.m\. will match am or a.m., or pm or p.m.

  • [0-9]|[a-zA-Z] will match a digit or a letter

Matching zero or more characters with asterisk

Asterisk (*) is probably the most powerful character in regex. Like the dot character, it has a special meaning. It matches zero or more of the thing that came just before it. Let’s see some examples.

  • a* will match a. It will also match aa or aaa or aaaa or aaaaa and so on and so forth. Here “a” is the thing that came before *. So the pattern will match zero or more a’s. Because of that it will also match an empty string.

  • ab* will match a, ab, abb, abbb, abbbb, etc.

The asterisk doesn’t have to be at the end. We can put it anywhere we want.

  • a*b will match b, ab, aab, aaab, aaaab, etc.

  • Cree*d will match Creed, Creeed, Creeeed, Creeeeed, etc.

We can’t put it on the front though because then there won’t be anything to repeat. Since regex doesn’t put any limitations on how we combine expressions, we can use * with sets or ranges like this:

  • [0-9]* will match any number of digits

  • [a-z]* will match any number of lowercase letters

  • [a-zA-Z0-9_]* will match any number of word characters (letter, number, or underscore)

We can even combine * with special shorthands like this:

  • \d* will match any number of digits

  • \w* will match any number of word characters

  • \s* will match any number of white spaces

Mother of all regexes

Remember, earlier I said * is probably the most powerful character in regex. There is a reason for that. If we combine it with the dot character, we can create the mother of all regexes: .*. It will match any number of any characters. Basically, it will match anything. That is because the dot matches a single character (doesn’t matter what character it is) and the asterisk matches any number of characters represented by the dot. If it matches everything, how can it be useful? Let’s see some examples:

  • .*Kramer will match the name of anyone whose last name is Kramer.

  • Cosmo.* will match the name of anyone whose first name is Cosmo.

  • .*Sacamano.* will match the name that has Sacamano somewhere in it.

That should be enough basics about regular expressions for you to feel comfortable with the examples shown below. There is so much more to regular expressions that we barely scratched its surface. Multiple books have been written on the subject. There are also plenty of online tools for building and testing complex regexes.

Regular Expressions in Elm

Now that we have the basics of regular expressions down, it’s time to understand that glorious pattern we wrote for extracting the launch time of Apollo 11 at the beginning of this section. Here is that code again:

> import Regex
> pattern = Regex.regex "\\d\\d:\\d\\d (a\\.m\\.|p\\.m\\.)"
{}

> string = """ \
| On July 16, 1969, the massive Saturn V rocket \
| lifted off from NASA's Kennedy Space Center at \
| 09:32 a.m. EDT. Four days later, on July 20, Neil \
| Armstrong and Buzz Aldrin landed on the Moon. \
| """

> Regex.contains pattern string
True

We start by importing the Regex module. Like the String module, it too comes preloaded with the Elm Platform, but doesn’t get loaded into the repl automatically. For the rest of the examples in this section, we will assume that the Regex module is already imported.

Next we use the regex function to define a pattern that will match the time (09:32 a.m.) we’re looking for. The string where the time is hidden is defined with a multi-line syntax. Finally, we use the contains function from Regex module (not the one from String module) to check if any substring in the string matches our pattern.

This is the second time we encountered two functions with the same name from different modules. This is quite common in Elm. In addition to grouping similar functions, modules also act as namespaces. Therefore, String.contains and Regex.contains are two completely separate functions.

The pattern we used as an argument to the regex function looks confusing, doesn’t it? When we were learning the basics of regular expressions in the previous section, we used only one \ to either use a special shorthand or escape dot like this: \d and \.. Why then do we need two \s in Elm code?

That’s because \ has a special purpose in Elm - to escape other characters. When you place it before another character, it removes the second character’s special Elm meaning and lets it be a regular character instead. If you recall from the String section, when we used a double quote inside a single-line string, we had to escape it like this so the string wouldn’t end early: "Michael Scott's Rabies Awareness \"Fun Run\" Race for the Cure". If you want to use a literal \ it needs to be escaped just like double quotes do. Put two \\ in a row and Elm will understand that you want to use the literal \ character. If you use just one \ Elm will think you are telling it to remove the next character’s special Elm meaning.

Extracting a Substring

We set out to extract the substring 09:32 a.m., but all we have done so far is verify that it exists. To extract it we will use the Regex.find function which is much more powerful than the slice function from the String module. find takes three parameters:

  1. The number of occurrences of the substring we want to find. For example, All will find all occurrences, whereas Atmost 2 will find at most two occurrences.

  2. Regular expression pattern representing the substring.

  3. Original string where the substring is hiding.

Let’s use the find function to match 09:32 a.m.

> launchTimes = Regex.find (Regex.AtMost 1) pattern string
[{ match = "09:32 a.m.", submatches = [Just "a.m."], index = 97, number = 1 }]

The output is a little hard to read. So let me format it to make it look nicer.

[
  { match = "09:32 a.m.",
    submatches = [Just "a.m."],
    index = 97,
    number = 1
  }
]

find doesn’t return just the substring we’re looking for. It returns a list of records containing information about the matches.

For now, you can think of a record as a collection of key value pairs. We will cover it in detail in the Record section.

As the output above shows, find returns four pieces of information about each match it found:

  1. The substring we are looking for.

  2. Submatches - find also looks for substrings that match any subpattern included inside the original pattern. In our case, the substring “a.m.” matches the subpattern (a\\.m\\.|p\\.m\\.). All submatches are wrapped inside Just. Don’t worry about Just for now; we’ll cover it later. The examples in this section don’t take advantage of submatches. So it’s safe to ignore them too for now.

  3. The index of the matched substring in the original string.

  4. If find matches multiple substrings, it labels each match with a number starting at one. The first match is labeled 1, the second is labeled 2, and so on and so forth. This number will be important when replacing all occurrences of a substring later.

We are one more step away from extracting our sneaky substring friend. We just need to figure out a way to get to that match key inside the record. We will use a function called List.map for that.

> List.map (\launchTime -> launchTime.match) launchTimes
["09:32 a.m."]

Finally, we have our substring. Well, it’s still wrapped in a list, but we will have to wait until the next section to find out how to free it from the clutches of List.

List.map
List.map creates a new list with the results of applying a provided function to every element in the original list. Like the String.filter function we saw in the previous section, we give it an anonymous function that reads the value stored in match key.

Finding multiple occurrences of a substring

Let’s see an example of a string that has multiple occurrences of a substring we’re looking for.

> pattern = Regex.regex "quitter"
{}

> string = """ \
| I'm a great quitter. It's one of the few things \
| I do well. I come from a long line of quitters. \
| My father was a quitter, my grandfather was a \
| quitter... I was raised to give up. \
| """

> Regex.find Regex.All pattern string
...

Here’s the output after some formatting:

[
  { match = "quitter",
    submatches = [],
    index = 14,
    number = 1
  },
  { match = "quitter",
    submatches = [],
    index = 89,
    number = 2
  },
  { match = "quitter",
    submatches = [],
    index = 116,
    number = 3
  },
  { match = "quitter",
    submatches = [],
    index = 147,
    number = 4
  }
]

This time we provided Regex.All as the first argument to find because we are looking for all occurrences of the substring “quitter”. It found four of those. Notice how the value for number key is incremented for each match.

Replacing a Substring

Although George likes to brag about how good of a quitter he is, let’s make him a go-getter for once. We’ll replace all occurrences of the substring “quitter” with “go-getter” in the original string. To do that we will have to use the replace function.

> Regex.replace Regex.All pattern (\_ -> "go-getter") string
...

Here’s how the output looks after some formatting:

"I'm a great go-getter. It's one of the few things
I do well. I come form a long line of go-getters.
My father was a go-getter, my grandfather was a
go-getter... I was raised to give up."

I removed the \n characters and extra spaces to make the output look nicer. You will also see an extra space and \n character in the beginning of the resulting string. That’s because when we defined the string constant with multi-line syntax, we used a space and \n to make the code look nicer.

> string = """ \
| I'm a great quitter. It's one of the few things \
...

We could have just well begun our string in the first line like this:

> string = """I'm a great quitter. It's one of the few things \
...

replace function takes four arguments:

  1. How many matches to replace. We passed Regex.All because we want to replace all occurrences of “quitter”. If we want to replace just the first occurrence, we can pass (Regex.AtMost 1).

  2. Pattern for matching the substring we want to replace.

  3. A function that takes the matched substring as an argument and returns an alternative. Notice how we used _ in place of a parameter name in our anonymous function. Since our function simply returns a completely new substring (“go-getter”) and doesn’t attempt to modify the original substring (“quitter”) in any form, we are not interested in what’s passed to our function. By using _, we are essentially ignoring the argument.

  4. The original string that contains the substring we want to replace.

Splitting a String

We already know how to split a string using the String.split function. But that function is limited to splitting a string based only on a separator. What if we need to split a string using a complex pattern? For example, the following string contains two instances of time: 09:32 and 10:56

> string = """On July 16, 1969, the massive Saturn \
| V rocket lifted off from NASA's Kennedy Space Center \
| at 09:32 a.m. EDT. Four days later, on July 20 at \
| 10:56 p.m. EDT, Neil Armstrong and Buzz Aldrin landed \
| on the Moon. \
| """

Let’s say we want to split it right where the times appear into three substrings like this:

"On July 16, 1969, the massive Saturn V rocket lifted
off from NASA's Kennedy Space Center at ",

" a.m. EDT. Four days later, on July 20 at",

" p.m. EDT, Neil Armstrong and Buzz Aldrin landed on the Moon."

Not sure why we would arbitrarily split a string like that, but if we do we will need something more flexible than String.split. As it so happens, the Regex module also provides a function called split which uses a pattern to break a string into a list of substrings.

> pattern = Regex.regex "\\d\\d:\\d\\d"
{}

> Regex.split Regex.All pattern string
...

Here’s the output after some formatting:

[
  "On July 16, 1969, the massive Saturn \nV rocket lifted off from NASA's Kennedy Space Center \nat ",

  " a.m. EDT. Four days later, on July 20 at \n",

  " p.m. EDT, Neil Armstrong and Buzz Aldrin landed \non the Moon. \n"
]

Removing case sensitivity

Regular expressions are case-sensitive by default. Here’s an example:

> pattern = Regex.regex "phoenix"
{}

> Regex.contains pattern "I'm like the Phoenix, rising from Arizona."
False

We can turn this behavior off by running our pattern through the caseInsensitive function.

> pattern = Regex.caseInsensitive (Regex.regex "phoenix")
{}

> Regex.contains pattern "I'm like the Phoenix, rising from Arizona."
True

We covered almost everything in the Regex module, but you can check out the full documentation here.

Back to top

New chapters are coming soon!

Sign up for the Elm Programming newsletter to get notified!

* indicates required
Close