3.15

Regular Expression

So far we have been able to extract a substring by specifying its start and end locations within a string like this:

> String.slice 0 5 "Bears. Beets. Battlestar Galactica."
"Bears"

What if we need something more sophisticated than this? Let’s say we want to extract the launch time of Apollo 11 (09:32 a.m.) hiding inside the following text.

On July 16, 1969, the massive Saturn V rocket lifted
off from NASA's Kennedy Space Center at 09:32 a.m. EDT.
Four days later, on July 20, Neil Armstrong and Buzz Aldrin
landed on the Moon.

It’s possible to accomplish this by using string functions, but it’s also tedious. With regular expressions it’s a piece of cake once we know how to define a pattern. Let’s install the elm/regex package which contains functions and values for working with regular expressions. Run the following command from the beginning-elm directory in terminal.

$ elm install elm/regex

Answer Y and start an elm repl session. Now import the Regex module which resides in the elm/regex package.

> import Regex

Next, we’ll define a pattern for 09:32 a.m. and use various functions in the Regex module to check whether that substring exists in the text above. Don’t worry about understanding the code below yet. It’s there to give you an idea of how regular expressions work in Elm. We’ll go over it in detail after we understand the basics of regular expression.

> pattern = "\\d\\d:\\d\\d (a\\.m\\.|p\\.m\\.)"
"\\d\\d:\\d\\d (a\\.m\\.|p\\.m\\.)"

> maybeRegex = Regex.fromString pattern
Just {}

> regex = Maybe.withDefault Regex.never maybeRegex
{}

> apollo11 = """ \
|   On July 16, 1969, the massive Saturn V rocket \
|   lifted off from NASA's Kennedy Space Center at \
|   09:32 a.m. EDT. Four days later, on July 20, Neil \
|   Armstrong and Buzz Aldrin landed on the Moon. \
|   """

> Regex.contains regex apollo11
True

Regular Expression Basics

A regular expression (regex) is a pattern for matching character combinations in a string. It is an incredibly powerful tool for working with strings, yet the basic concepts behind it aren’t that complex. Before we attempt to understand the code above, let’s get familiar with the basics of regular expressions first. After that we will learn how to use them in Elm.

Note: If you already know how regular expressions work, feel free to skip to the Regular Expressions in Elm section below.

Matching a single character

In a regular expression, letters and numbers match themselves. For example, A will match A and 1 will match 1. All regular expressions are case-sensitive, so A won’t match a.

Matching multiple characters

To match multiple characters, we just need to repeat the character. AAA will match three As in a row and 123 will match the number 123.

Dot character

Instead of repeating a character, we can use the dot character to match multiple characters like this: A... The dot character matches a single character. A character can be a letter, number, or any special character. So A.. will match AAA, ABB, ACC, ABC, and so on. It will also match A12, A$3, and A^%. Punctuation characters such as . allow us to create patterns instead of explicitly specifying all characters in a substring.

The punctuation characters have special meanings in a regular expression. So if we want to match a literal dot, we must precede it with a backslash which turns off dot’s special meaning allowing regex to treat it as a regular character. Here are some examples that match a literal dot.

  • \. will match a literal dot.

  • Dr\. Strange will match Dr. Strange.

  • Kevin Malone, Esq\. will match Kevin Malone, Esq.

  • 7\.67 will match 7.67

Matching a set

So far we have focused on matching a specific character. What if we want to match the characters of same kind, for example only numbers? We can use a set for that. Sets are created by wrapping the characters in square brackets. Here are some examples:

  • [0123456789] will match a single digit.

  • [aeiou] will match a single lowercase vowel.

  • [AEIOU] will match a single uppercase vowel.

It’s important to note that a set will match only a single character. For example [Pp] will match a single character (either P or p), not PP, pp, or Pp. If we want to match more than one character, we can just repeat the set like this: [Pp][Pp] and it will match all of these combinations: PP, Pp, pp, pP.

One of the reasons why regular expressions are so powerful is because we can build more and more complex patterns by progressively combining different kinds of expressions.

Using ranges to make sets more succinct

Sets don’t scale well. For example, if we want to match two lowercase letters, we will have to use this: [abcdefghijklmnopqrstuvwxyz][abcdefghijklmnopqrstuvwxyz]. Yikes! This is where a range comes handy. Instead of typing each element in a set, we just define a range. We can rewrite that super long regex with ranges like this: [a-z][a-z]. All we need to do is specify the beginning and end of a sequence of characters separated by a dash.

Similarly, [0-9] will match any digit. And [A-Za-z0-9_] will match any word character (letter, number, or underscore). Since certain character sets are used often, regex provides a series of shorthands for representing them. Here are some examples:

  • \d will match any digit. It is short for [0-9].

  • \w will match any word character. It is short for [A-Za-z0-9_].

  • \s will match any whitespace character. It is short for [\t\r\n\f].

  • \s\d will match a whitespace character followed by a digit.

  • [\da-f] will match a hexadecimal digit. It is short for [0-9a-f]. Here we combined a shorthand with a range. We can combine expressions however we want.

Matching alternatives

Sometimes we need to match this character or that character. We can use a vertical bar for that. Here are some examples:

  • X|Y will match either X or Y.

  • EST|PST will match either EST or PST.

  • Jim|Pam will match the name of one of the two biggest pranksters in Dunder Mifflin’s history: Jim or Pam.

  • am|a\.m\.|pm|p\.m\. will match am or a.m., or pm or p.m.

  • [0-9]|[a-zA-Z] will match a digit or a letter.

Matching zero or more characters with asterisk

Asterisk (*) is probably the most powerful character in a regular expression. Like the dot character, it has a special meaning. It matches zero or more of the thing that came just before it. Let’s see some examples.

  • a* will match a. It will also match aa or aaa or aaaa or aaaaa and so on. Here a is the thing that came before *. So the pattern will match zero or more as. Because of that it will also match an empty string.

  • ab* will match a, ab, abb, abbb, abbbb, etc.

The asterisk doesn’t have to be at the end. We can put it anywhere we want.

  • a*b will match b, ab, aab, aaab, aaaab, etc.

  • Cree*d will match Creed, Creeed, Creeeed, Creeeeed, etc.

We can’t put it in the front though because then there won’t be anything to repeat. Since regular expressions don’t put any limitations on how we combine expressions, we can use * with sets or ranges too.

  • [0-9]* will match any number of digits.

  • [a-z]* will match any number of lowercase letters.

  • [a-zA-Z0-9_]* will match any number of word characters (letter, number, or underscore).

We can even combine * with special shorthands like this:

  • \d* will match any number of digits.

  • \w* will match any number of word characters.

  • \s* will match any number of white spaces.

Mother of all regular expressions

Remember, earlier I said * is probably the most powerful character in a regular expression. There is a reason for that. If we combine it with the dot character, we can create the mother of all regular expressions: .*. It will match any number of any characters. Basically, it will match anything. That is because the dot matches a single character (doesn’t matter what character it is) and the asterisk matches any number of characters represented by the dot. If it matches everything, how can it be useful? Let’s see some examples:

  • .*Kramer will match the name of anyone whose last name is Kramer.

  • Cosmo.* will match the name of anyone whose first name is Cosmo.

  • .*Sacamano.* will match the name that has Sacamano somewhere in it.

That should be enough basics about regular expressions for you to feel comfortable with the examples shown below.

Note: There is so much more to regular expressions. We barely scratched the surface here. If you are interested in exploring them further, there are plenty of resources online including tools for building and testing complex regular expressions.

Regular Expressions in Elm

Now that we have the basics of regular expressions down, it’s time to understand that glorious pattern we wrote earlier for extracting the launch time of Apollo 11. Here is that code again:

> import Regex

> pattern = "\\d\\d:\\d\\d (a\\.m\\.|p\\.m\\.)"
"\\d\\d:\\d\\d (a\\.m\\.|p\\.m\\.)"

> maybeRegex = Regex.fromString pattern
Just {}

> regex = Maybe.withDefault Regex.never maybeRegex
{}

> apollo11 = """ \
|   On July 16, 1969, the massive Saturn V rocket \
|   lifted off from NASA's Kennedy Space Center at \
|   09:32 a.m. EDT. Four days later, on July 20, Neil \
|   Armstrong and Buzz Aldrin landed on the Moon. \
|   """

> Regex.contains regex apollo11
True

Let’s go through it step-by-step.

Step 1: Start by importing the Regex module.

> import Regex

Unlike the String module, Regex doesn’t come preloaded with the Elm Platform. That’s why we had to install the elm/regex package separately. For the rest of the examples in this section, we will assume that Regex is already imported.

Step 2: Define a pattern that matches the time (09:32 a.m.) we’re looking for.

> pattern = "\\d\\d:\\d\\d (a\\.m\\.|p\\.m\\.)"

The following diagram illustrates how the pattern above matches 09:32 a.m..

\ vs \\
When we were learning the basics of regular expressions earlier, we used only one \ to either use a special shorthand or escape a dot like this: \d and \.. Why then do we need two \s in Elm code? That’s because \ has a special purpose in Elm — escape other characters. When you place it before another character, it removes the second character’s special Elm meaning and lets it be a regular character instead.

If you recall from the String section earlier, when we used a double quote inside a single-line string, we had to escape it like this so the string wouldn’t end early: "Michael Scott's Rabies Awareness \"Fun Run\" Race for the Cure".

If you want to use a literal \, it needs to be escaped just like double quotes do. Put two \\ in a row and Elm will understand that you want to use the literal \ character. If you use just one \ Elm will think you are telling it to remove the next character’s special Elm meaning.

Step 3: Create a regular expression by passing pattern to the Regex.fromString function.

> maybeRegex = Regex.fromString pattern
Just {} : Maybe Regex

Not all strings are valid regex patterns. That’s why the fromString function returns a Maybe value.

Maybe
When Elm cannot guarantee a value, it returns a data structure called Maybe. If the value is present, it’s wrapped inside Just, otherwise Nothing is returned. Just and Nothing are members of the Maybe type. This simple concept is at the heart of writing incredibly robust applications in Elm. The next chapter will cover Maybe in much more detail. Until then you can think of Maybe as a container that can hold at most one value.

Step 4: Extract the regular expression inside Maybe container using the Maybe.withDefault function.

> regex = Maybe.withDefault Regex.never maybeRegex
{}

The following diagram illustrates how Maybe.withDefault works.

Step 5: Define the string that contains the launch time of Appolo 11 (09:32 a.m.).

> apollo11 = """ \
|   On July 16, 1969, the massive Saturn V rocket \
|   lifted off from NASA's Kennedy Space Center at \
|   09:32 a.m. EDT. Four days later, on July 20, Neil \
|   Armstrong and Buzz Aldrin landed on the Moon. \
|   """

Step 6: Use the Regex.contains function to check if apollo11 has any substrings that match the pattern defined in step 2.

> Regex.contains regex apollo11
True

Note: The String module also has a function named contains which works differently than Regex.contains. This is the second time we encountered two functions with the same name from different modules. This is quite common in Elm. In addition to grouping similar functions, a module also acts as a namespace. Therefore, String.contains and Regex.contains are two completely separate functions.

Extracting a Substring

We set out to extract the substring 09:32 a.m., but all we have done so far is verify that it exists. To extract it, we will use the Regex.find function which is much more powerful than String.slice explained in the Substrings section. Regex.find and Regex.contains both take the exact same arguments:

1. A regular expression pattern that represents a substring.

2. Original string where the substring is hiding.

> launchTimes = Regex.find regex apollo11

The output is easier to read if it’s formatted like this:

[
    { index = 103
    , match = "09:32 a.m."
    , number = 1
    , submatches = [Just "a.m."] 
    }
]

Regex.find doesn’t return just the substring we’re looking for. It returns a list of records each containing information about the matches.

Record
For now, you can think of a record as a collection of key value pairs. We will cover it in detail later in the Record section.

As the output above shows, Regex.find returns four pieces of information about each match it found:

  • index: The index of the matched substring in the original string.

  • match: The substring we are looking for.

  • number: If multiple substrings are found, Regex.find labels each match with a number starting at one. The first match is labeled 1, the second is labeled 2, and so on. This number will be important when replacing all occurrences of a substring later.

  • submatches: Regex.find also looks for substrings that match any subpattern included inside the original pattern. In our case, the substring “a.m.” matches the subpattern (a\\.m\\.|p\\.m\\.). All submatches are wrapped inside Just. The examples in this section don’t take advantage of submatches, so it’s safe to ignore them for now.

We are one more step away from extracting our sneaky substring friend. We just need to figure out a way to get to that match key inside the record. We will use a function called List.map for that.

> List.map (\launchTime -> launchTime.match) launchTimes
["09:32 a.m."]

Finally, we have our substring. Well, it’s still wrapped in a list, but we will have to wait until the next section to find out how to free it from the clutches of List.

List.map
List.map creates a new list with the results of applying a provided function to every element in the original list. Just like the String.filter function we saw in the previous section, we give it an anonymous function that reads the value stored in match key. We’ll cover it in detail in the Mapping a List section.

Finding Multiple Occurrences of a Substring

Let’s see an example of a string that has multiple occurrences of a substring we’re looking for.

> pattern = "quitter"
"quitter"

> maybeRegex = Regex.fromString pattern
Just {}

> regex = Maybe.withDefault Regex.never maybeRegex
{}

> string = """ \
|   I'm a great quitter. It's one of the few things \
|   I do well. I come from a long line of quitters. \
|   My father was a quitter, my grandfather was a \
|   quitter... I was raised to give up. \
|   """

> Regex.find regex string
...

Here’s the output after some formatting:

[
  { index = 16, 
    match = "quitter", 
    number = 1, 
    submatches = [] 
  },
  { index = 93, 
    match = "quitter", 
    number = 2, 
    submatches = [] 
  },
  { index = 122, 
    match = "quitter", 
    number = 3, 
    submatches = [] 
  },
  { index = 155, 
    match = "quitter", 
    number = 4, 
    submatches = [] 
  }
]

Regex.find found four occurrences of the substring "quitter". Notice how the value for number key is incremented for each match.

Replacing a Substring

Although George likes to brag about how good of a quitter he is, let’s make him a go-getter for once. We’ll replace all occurrences of the substring "quitter" with "go-getter" in the original string. To do that we will need to use the Regex.replace function.

> Regex.replace regex (\_ -> "go-getter") string
...

Here’s how the output looks after some formatting:

"I'm a great go-getter. It's one of the few things
I do well. I come form a long line of go-getters.
My father was a go-getter, my grandfather was a
go-getter... I was raised to give up."

\n characters and extra spaces have been removed to make the output look nicer. In the original output, you’ll see an extra space and \n in the beginning. That’s because when we defined the string constant with multi-line syntax, we used a space and \n to make the code look nicer.

> string = """ \
|   I'm a great quitter. It's one of the few things \
...

We could have just begun our string in the first line like this:

> string = """I'm a great quitter. It's one of the few things \
...

The Regex.replace function takes three arguments:

1. A regular expression that contains the pattern for matching the substring we want to replace.

2. A function that takes the matched substring as an argument and returns a replacement. Notice how we used _ in place of a parameter name inside the anonymous function. Since our function simply returns a new substring ("go-getter") and doesn’t modify the original substring ("quitter") in any form, we are not interested in the argument. By using _, we are essentially ignoring the argument.

3. The original string that contains the substring we want to replace.

Splitting a String

In the Splitting a String section earlier, we learned how to split a string by using the String.split function. That function is limited to splitting a string based only on a separator. What if we need to split a string using a complex pattern? For example, the following string contains two instances of time: 09:32 and 10:56

> string = """On July 16, 1969, the massive Saturn \
|   V rocket lifted off from NASA's Kennedy Space Center \
|   at 09:32 a.m. EDT. Four days later, on July 20 at \
|   10:56 p.m. EDT, Neil Armstrong and Buzz Aldrin landed \
|   on the Moon. \
|   """

Let’s say we want to split it right where the times appear into three substrings like this:

"On July 16, 1969, the massive Saturn V rocket lifted
off from NASA's Kennedy Space Center at ",

" a.m. EDT. Four days later, on July 20 at",

" p.m. EDT, Neil Armstrong and Buzz Aldrin landed on the Moon."

Not sure why we would arbitrarily split a string like that, but if we do we will need something more flexible than String.split. As it so happens, the Regex module also provides a function called split which uses a regular expression pattern to break a string into a list of substrings. Let’s try it out.

> pattern = "\\d\\d:\\d\\d"
"\\d\\d:\\d\\d"

> maybeRegex = Regex.fromString pattern
Just {}

> regex = Maybe.withDefault Regex.never maybeRegex
{}

> Regex.split regex string
...

Here’s the output after some formatting:

[
  "On July 16, 1969, the massive Saturn \nV rocket lifted off from NASA's Kennedy Space Center \nat ",

  " a.m. EDT. Four days later, on July 20 at \n",

  " p.m. EDT, Neil Armstrong and Buzz Aldrin landed \non the Moon. \n"
]

Removing Case Sensitivity

Regular expressions are case-sensitive by default. If we want to make them case-insensitive, we need to use the Regex.fromStringWith function to create a regex. Let’s try it out.

> pattern = "phoenix"
"phoenix"

> options = { caseInsensitive = True, multiline = False }
{ caseInsensitive = True, multiline = False }

> maybeRegex = Regex.fromStringWith options pattern
Just {}

> regex = Maybe.withDefault Regex.never maybeRegex
{}

> Regex.contains regex "I'm like the Phoenix, rising from Arizona."
True

Regex.fromStringWith takes an options record that not only lets us specify case sensitivity but also the multiline status.

We covered almost everything in the Regex module, but you can still check out its full documentation to learn more.

Back to top

New chapters are coming soon!

Sign up for the Elm Programming newsletter to get notified!

* indicates required
Close