Using conditionals inside Ruby regular expressions

In this post, we'll dive into regex conditionals and discuss how to work around the limitations in Ruby's implementation of them

Of the many new features that Ruby 2.0 shipped back in 2013, the one I paid least attention to was the new regular expression engine, Onigmo. After all, regular expressions are regular expressions - why should I care how Ruby implements them?

As it turns out, the Onigmo regex engine has a few neat tricks up its sleeve including the ability to use conditionals inside of your regular expressions.

In this post we'll dive in to regex conditionals, learn about the quirks of Ruby's implementation of them, and discuss a few tricks to work around Ruby's limitations. Let's get started!

Groups & Capturing

To understand conditionals in regular expressions, you first need to understand grouping and capturing.

Imagine that you have a list of US cites:

Fayetteville, AR
Seattle, WA

You'd like to separate the city name from the state abbreviation. One way to do this is to perform multiple matches:

PLACE = "Fayetteville, AR"

# City: Match any char that's not a comma
PLACE.match(/[^,]+/) 
# => #<MatchData "Fayetteville">

# Separator: Match a comma and optional spaces
PLACE.match(/, */)
# => #<MatchData ", ">

# State: Match a 2-letter code at the end of the string. 
PLACE.match(/[A-Z]{2}$/) 
# => #<MatchData "AR">

This works, but it's too verbose. By using groups, you can capture both the city and state with only one regular expression.

So let's combine the regular expressions above, and surround each section with parentheses. Parens are how you group things in regular expressions.

PLACE = "Fayetteville, AR"
m = PLACE.match(/([^,]+)(, *)([A-Z]{2})/) 
# => #<MatchData "Fayetteville, AR" 1:"Fayetteville" 2:", " 3:"AR">

As you can see, the expression above captures both city and state. You access them by treating MatchData like an array:

m[1]
# => "Fayetteville"
m[3]
# => "AR"

The problem with grouping, as it's done above, is that the captured data is put into an array. If its position in the array changes, you have to update your code or you've just introduced a bug.

For example, we might decide that it's silly to capture the ", " characters. So we remove the parens around that part of the regular expression:

m = PLACE.match(/([^,]+), *([A-Z]{2})/) 
# => #<MatchData "Fayetteville, AR" 1:"Fayetteville" 2:"AR">

m[3]
# => nil

But now m[3] no longer contains the state - bug city.

Named groups

You can make regular expression groups a lot more semantic by naming them. The syntax is pretty similar to what we just used. We surround the regex in parens, and specify the name like so:

/(?<groupname>regex)/

If we apply this to our city/state regular expression, we get:

m = PLACE.match(/(?<city>[^,]+), *(?<state>[A-Z]{2})/)
# => #<MatchData "Fayetteville, AR" city:"Fayetteville" state:"AR">

And we can access the captured data by treating MatchData like a hash:

m[:city] 
# => "Fayetteville"

Conditionals

Conditionals in regular expressions take the form /(?(A)X|Y)/. Here are a few valid ways to use them:

# If A is true, then evaluate the expression X, else evaluate Y
/(?(A)X|Y)/

# If A is true, then X
/(?(A)X)/

# If A is false, then Y
/(?(A)|Y)/

Two of the most common options for your condition, A are:

  • Has a named or numbered group been captured?
  • Does a look-around evaluate to true?

Let's look at how to use them:

Has a group been captured?

To check for the presence of a group, use the ?(n) syntax, where n is an integer, or a group name surrounded by <> or ''.

# Has group number 1 been captured?
/(?(1)foo|bar)/

# Has a group named "mygroup" been captured?
/(?(<mygroup>)foo|bar)/

Example

Imagine you're parsing US telephone numbers. These numbers have a three-digit area code that is optional unless the number starts with one.

1-800-555-1212 # Valid
800-555-1212 # Valid
555-1212 # Valid

1-555-1212 # INVALID!!

We can use a conditional to make the area code a requirement only if the number starts with 1.

# This regular expression looks complex, but it's made of simple pieces
# `^(1-)?` Does the string start with "1-"? If so, capture it as group 1
# `(?(1)` Was anything captured in group one?
# `\d{3}-` if so, do a required match of three digits and a dash (the area code)
# `|(\d{3}-)?` if not, do an optional match of three digits and a dash (area code)
# `\d{3}-\d{4}` match the rest of the phone number, which is always required.

re = /^(1-)?(?(1)\d{3}-|(\d{3}-)?)\d{3}-\d{4}/

"1-800-555-1212".match(re)
#=> #<MatchData "1-800-555-1212" 1:"1-" 2:nil>

"800-555-1212".match(re)
#=> #<MatchData "800-555-1212" 1:nil 2:"800-">

"555-1212".match(re)
#=> #<MatchData "555-1212" 1:nil 2:nil>

"1-555-1212".match(re)
=> nil

Limitations

One problem with using group-based conditionals is that matching a group "consumes" those characters in the string. Those characters can't be used by the conditional, then.

For example, the following code tries and fails to match 100 if the text "USD" is present:

"100USD".match(/(USD)(?(1)\d+)/) # nil

In Perl and some other languages, you can add a look-ahead statement to your conditional. This lets you trigger the conditional based on text anywhere in the string. But Ruby doesn't have this, so we have to get a little creative.

Look-around

Fortunately, we can work around the limitations in Ruby's regex conditionals by abusing look-around expressions.

What is a look-around?

Normally, the regular expression parser steps through your string from the beginning to the end looking for matches. It's like moving the cursor from left to right in a word processor.

Look-ahead and look-behind expressions work a little differently. They let you inspect the string without consuming any characters. When they're done, the cursor is left in the exact same spot it was at the beginning.

For a great introduction to look-arounds, check out Rexegg's guide to mastering look ahead and look behind

The syntax looks like so:

Type Syntax Example
Look Ahead (?=query) \d+(?= dollars) matches 100 in "100 dollars"
Negative Look Ahead (?!query) \d+(?! dollars) matches 100 if it is NOT followed by the word "dollars"
Look Behind (?<=query) (?<=lucky )\d matches 7 in "lucky 7"
Negative Look Behind (?<!query) (?<!furious )\d matches 7 in "lucky 7"

Abusing look-arounds to enhance conditionals

In our conditional, we can only query the existence of groups that have already been set. Normally, this means that content of the group has been consumed and isn't available to the conditional.

But you can use a look-ahead to set a group without consuming any characters! Is your mind blown yet?

Remember this code that didn't work?

"100USD".match(/(USD)(?(1)\d+)/) # nil

If we modify it to capture the group in a look-ahead, it suddenly works fine:

"100USD".match(/(?=.*(USD))(?(1)\d+)/)
=> #<MatchData "100" 1:"USD">

Let's break down that query and see what's going on:

  • (?=.*(USD)) Using a look-ahead, scan the text for "USD" and capture it in group 1
  • (?(1) If group 1 exists
  • \d+ Then match one or more numbers

Pretty neat, huh?

What to do next:
  1. Try Honeybadger for FREE
    Honeybadger helps you find and fix errors before your users can even report them. Get set up in minutes and check monitoring off your to-do list.
    Start free trial
    Easy 5-minute setup — No credit card required
  2. Get the Honeybadger newsletter
    Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.
    author photo

    Starr Horne

    Starr Horne is a Rubyist and Chief JavaScripter at Honeybadger.io. When she's not neck-deep in other people's bugs, she enjoys making furniture with traditional hand-tools, reading history and brewing beer in her garage in Seattle.

    More articles by Starr Horne
    Stop wasting time manually checking logs for errors!

    Try the only application health monitoring tool that allows you to track application errors, uptime, and cron jobs in one simple platform.

    • Know when critical errors occur, and which customers are affected.
    • Respond instantly when your systems go down.
    • Improve the health of your systems over time.
    • Fix problems before your customers can report them!

    As developers ourselves, we hated wasting time tracking down errors—so we built the system we always wanted.

    Honeybadger tracks everything you need and nothing you don't, creating one simple solution to keep your application running and error free so you can do what you do best—release new code. Try it free and see for yourself.

    Start free trial
    Simple 5-minute setup — No credit card required

    Learn more

    "We've looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release."
    — Michael Smith, Cofounder & CTO of YvesBlue

    Honeybadger is trusted by top companies like:

    “Everyone is in love with Honeybadger ... the UI is spot on.”
    Molly Struve, Sr. Site Reliability Engineer, Netflix
    Start free trial
    Are you using Sentry, Rollbar, Bugsnag, or Airbrake for your monitoring? Honeybadger includes error tracking with a whole suite of amazing monitoring tools — all for probably less than you're paying now. Discover why so many companies are switching to Honeybadger here.
    Start free trial