Confession time: I don't particularly like working with regular expressions. While I use them all the time, anything more complex than a /^foo.*$/ requires me to stop and think. While I'm sure there are people who can decipher expressions like \A(?=\w{6,10}\z)(?=[^a-z]*[a-z])(?=(?:[^A-Z]*[A-Z]){3}) at a glance, but it takes me several minutes of googling and makes me unhappy. It's quite a difference from reading Ruby.

If you're curious, the example above is taken from this article on regex lookaheads.

The Situation

At Honeybadger I'm currently working on improving our search UI. Like many search systems, ours uses a simple query language. Before my changes, if you wanted to search for a custom date range, you had to manually type in a query like so:

occurred:[2017-06-12T16:10:00Z TO 2017-06-12T17:10:00Z]

Ouch!

In the new search UI, we want to detect when you start typing a date-related query and pop up a helpful datepicker. And of course, the datepicker is just the beginning. Eventually we'll expand the context-sensitive hinting to cover more kinds of search terms. Here are a few examples:

assigned:jane@email.com context.user.id=100
resolved:false ignored:false occurred:[
params.article.title:"Starr's parser post"       foo:'ba

I need to tokenize these strings in such a way that:

  • Whitespace separates tokens, except when surrounded by '', "" or []
  • Unquoted whitespace is its own token
  • I can run tokens.join("") to exactly recreate the input string

For example:

tokenize(%[params.article.title:"Starr's parser post"       foo:'ba])
=> ["params.article.title:\"Starr's parser post\"", "       ", "foo:'ba"]

Using A Regular Expression

My first thought was to use a capturing regular expression to define what a valid token should look like, then use String#split to split the string into tokens. It's a pretty cool trick, actually:

# The parens in the regexp mean that the separator is added to the array
"foo  bar  baz".split(/(foo|bar|baz)/)
=> ["", "foo", "  ", "bar", "  ", "baz"]

This looked promising initially, despite the weird empty-strings. But my real-world regular expression was much more complex. My first draft looked like this:

/
  (                          # Capture group is so split will include matching and non-matching strings
    (?:                      # The first character of the key, which is
      (?!\s)[^:\s"'\[]{1}    # ..any valid "key" char not preceeded by whitespace
      |^[^:\s"'\[]{0,1}      # ..or any valid "key" char at beginning of line
    )
    [^:\s"'\[]*              # The rest of the "key" chars
    :                        # a colon
    (?:                      # The "value" chars, which are
      '[^']+'                # ..anything surrounded by single quotes
      | "[^"]+"              # ..or anything surrounded by double quotes
      | \[\S+\sTO\s\S+\]     # ..or anything like [x TO y]
      | [^\s"'\[]+           # ..or any string not containing whitespace or special chars
    )
  )
/xi 

Working with this gave me a sinking feeling. Every time I found an edge case I'd have to amend the regular expression, making it even more complex. In addition, it needed to work in Javascript as well as Ruby, so certain features like negative lookbehind weren't available.

...It was about this time that the absurdity of all this struck me. The regular expression approach I was using was much more complicated than it would be to write a simple parser from scratch.

Anatomy of a Parser

I'm no expert, but simple parsers are simple. All they do is:

  • Step through a string, character by character
  • Append each character to a buffer
  • When a token-separating condition is encountered, save the buffer to an array and empty it.

Knowing this, we can set up a simple parser that splits strings by whitespace. It's roughly the equivalent to "foo bar".split(/(\s+)/).

class Parser

  WHITESPACE = /\s/
  NON_WHITESPACE = /\S/

  def initialize
    @buffer = []
    @output = []
  end

  def parse(text) 
    text.each_char do |c|
      case c
      when WHITESPACE
        flush if previous.match(NON_WHITESPACE)
        @buffer << c
      else
        flush if previous.match(WHITESPACE)
        @buffer << c
      end
    end

    flush
    @output
  end

  protected

  def flush
    if @buffer.any?
      @output << @buffer.join("")
      @buffer = []
    end
  end

  def previous
    @buffer.last || ""
  end

end


puts Parser.new().parse("foo bar baz").inspect

# Outputs ["foo", " ", "bar", " ", "baz"]

This is a step in the direction of what I want, but it's missing support for quotes and brackets. Fortunately, adding that only takes a few lines of code:

  def parse(text) 

    surround = nil

    text.each_char do |c|
      case c
      when WHITESPACE
        flush if previous.match(NON_WHITESPACE) && !surround
        @buffer << c
      when '"', "'"
        @buffer << c
        if !surround
          surround = c
        elsif surround == c
          flush
          surround = nil
        end
      when "["
        @buffer << c
        surround = c if !surround
      when "]"
        @buffer << c
        if surround == "["
          flush
          surround = nil
        end
      else
        flush() if previous().match(WHITESPACE) && !surround
        @buffer << c
      end
    end

    flush
    @output
  end

This code is only a bit longer than my regular-expression-based approach but is much much more straightforward.

Parting thoughts

There's probably a regular expression out there that would work fine with my use case. If history is any guide, It's probably simple enough to make me look like a fool. :)

But I really enjoyed the chance to write this little parser. It broke me out of the rut I was in with the regex approach. As a nice bonus, I'm a lot more confident in the resulting code than I ever am with code that is based around complicated regular expressions.