Testing Ruby's Unicode Support

Among the new features shipped with Ruby 2.4 is improved Unicode support. Specifically, methods like upcase and downcase work as expected, turning "ä" to "Ä" and back. This made me curious: what other Unicode improvements have been made since 2013 when I read André Arko's blog post Strings in Ruby are UTF-8 now… right??

I tested all of Ruby's string methods, not looking for technical errors but for violations of the "principle of least surprise." Specifically, my assumptions were that:

Unique characters are unique: "e" and "ë" are different, just like "e" and "E" are.
Single characters count as single characters, no matter how they're represented in unicode. This means that "e" and "ë" are each a single character, even though the latter is represented by two code points.
Characters are immutable. Reversing a string of characters shouldn't alter the individual characters.
Whitespace is treated as whitespace. Even those tricky unicode whitespace characters.
Digits are treated as digits. The number 2 is always the number 2 no matter how it's written.

Unfortunately, most of Ruby's string manipulation methods fail these tests. If you're working with Unicode strings, you therefore have to be extremely careful which ones you use.

NOTE: After publication, some readers pointed out that many of the failures I mentioned wouldn't have happened if I would have normalized the unicode test strings. This is true. However strings aren't automatically normalized by Ruby or Rails (in any of the apps I tested). These tests were always meant to illustrate the worst-case and I think they're still useful in that regard.

Unicode tests with Ruby 2.4.0

Method	Test	Expected	Result	Verdict
#%	`"%s" % "noël"`	`"noël"`	`"noël"`	OK
#*	`"noël" * 2`	`"noëlnoël"`	`"noëlnoël"`	OK
#<<	`"noël" << "ë"`	`"noëlë"`	`"noëlë"`	OK
#<=>	`"ä" <=> "z"`	`-1`	`-1`	OK
#==	`"ä" == "ä"`	`true`	`true`	OK
#=~	`"ä" =~ /a./`	`nil`	`0`	Beware!
#[]	`"ä"[0]`	`"ä"`	`"a"`	Beware!
#[]=	`"ä"[0] = "u"`	`"u"`	`"u"`	OK
#b	`"ä".b.encoding.to_s`	`"ASCII-8BIT"`	`"ASCII-8BIT"`	OK
#bytes	`"ä".bytes`	`[97, 204, 136]`	`[97, 204, 136]`	OK
#bytesize	`"ä".bytesize`	`3`	`3`	OK
#byteslice	`"ä".byteslice(1)`	`"\xCC"`	`"\xCC"`	OK
#capitalize	`"ä".capitalize`	`"Ä"`	`"Ä"`	OK
#casecmp	`"äa".casecmp("äz")`	`-1`	`-1`	OK
#center	`"ä".center(3)`	`" ä "`	`"ä "`	Beware!
#chars	`"ä".chars`	`["ä"]`	`["a", "̈"]`	Beware!
#chomp	`"ä ".chomp`	`"ä"`	`"ä"`	OK
#chop	`"ä".chop`	`""`	`"a"`	Beware!
#chr	`"ä".chr`	`"ä"`	`"a"`	Beware!
#clear	`"ä".clear`	`""`	`""`	OK
#codepoints	`"ä".codepoints`	`[97, 776]`	`[97, 776]`	OK
#concat	`"ä".concat("x")`	`"äx"`	`"äx"`	OK
#count	`"ä".count("a")`	`0`	`1`	Beware!
#crypt	`"123".crypt("ää") == "123".crypt("aa")`	`false`	`false`	OK
#delete	`"ä".delete("a")`	`"ä"`	`"̈"`	Beware!
#downcase	`"Ä".downcase`	`"ä"`	`"ä"`	OK
#dump	`"ä".dump`	`"\"a\\u0308\""`	`"\"a\\u0308\""`	OK
#each_byte	`"ä".each_byte.to_a`	`[97, 204, 136]`	`[97, 204, 136]`	OK
#each_char	`"ä".each_char.to_a`	`["ä"]`	`["a", "̈"]`	Beware!
#each_codepoint	`"ä".each_codepoint.to_a`	`[97, 776]`	`[97, 776]`	OK
#each_line	`"ä".each_line.to_a`	`["ä"]`	`["ä"]`	OK
#empty?	`"ä".empty?`	`false`	`false`	OK
#encode	`"ä".encode("ASCII", undef: :replace)`	`"a?"`	`"a?"`	OK
#encoding	`"ä".encoding.to_s`	`"UTF-8"`	`"UTF-8"`	OK
#end_with?	`"ä".end_with?("ä")`	`true`	`true`	OK
#eql?	`"ä".eql?("a")`	`false`	`false`	OK
#force_encoding	`"ä".force_encoding("ASCII")`	`"a\xCC\x88"`	`"a\xCC\x88"`	OK
#getbyte	`"ä".getbyte(2)`	`136`	`136`	OK
#gsub	`"ä".gsub("a", "x")`	`"ä"`	`"ẍ"`	Beware!
#hash	`"ä".hash == "a".hash`	`false`	`false`	OK
#include?	`"ä".include?("a")`	`false`	`true`	Beware!
#index	`"ä".index("a")`	`nil`	`0`	Beware!
#replace	`"ä".replace("u")`	`"u"`	`"u"`	OK
#insert	`"ä".insert(1, "u")`	`"äu"`	`"aü"`	Beware!
#inspect	`"ä".inspect`	`"\"ä\""`	`"\"ä\""`	OK
#intern	`"ä".intern`	`:ä`	`:ä`	OK
#length	`"ä".length`	`1`	`2`	Beware!
#ljust	`"ä".ljust(3, "_")`	`"ä__"`	`"ä_"`	Beware!
#lstrip	`" ä".lstrip`	`"ä"`	`"ä"`	OK
#match	`"ä".match("a")`	`nil`	`#`	Beware!
#next	`"ä".next`	`"ä"`	`"b̈"`	Beware!
#ord	`"ä".ord`	`97`	`97`	OK
#partition	`"händ".partition("a")`	`["händ"]`	`["h", "a", "̈nd"]`	Beware!
#prepend	`"ä".prepend("ä")`	`"ää"`	`"ää"`	OK
#replace	`"ä".replace("ẍ")`	`"ẍ"`	`"ẍ"`	OK
#reverse	`"händ".reverse`	`"dnäh"`	`"dn̈ah"`	Beware!
#rpartition	`"händ".rpartition("a")`	`["händ"]`	`["h", "a", "̈nd"]`	Beware!
#rstrip	`"line ".rstrip`	`"line"`	`"line "`	Beware!
#scrub	`"ä".scrub`	`"ä"`	`"ä"`	OK
#setbyte	`s = "ä"; s.setbyte(0, "x".ord); s`	`"ẍ"`	`"ẍ"`	OK
#size	`"ä".size`	`1`	`2`	Beware!
#slice	`"ä".slice(0)`	`"ä"`	`"a"`	Beware!
#split	`"ä".split("a")`	`["ä"]`	`["", "̈"]`	Beware!
#squeeze	`"ää".squeeze("ä")`	`"ä"`	`"ää"`	Beware!
#start_with?	`"ä".start_with?("a")`	`false`	`true`	Beware!
#strip	`" line ".strip`	`"line"`	`" line "`	Beware!
#sub	`"ä".sub("a", "x")`	`"ä"`	`"ẍ"`	Beware!
#succ	`"ä".succ`	`"b̈"`	`"b̈"`	OK
#swapcase	`"ä".swapcase`	`"Ä"`	`"Ä"`	OK
#to_c	`"١".to_c`	`(1+0i)`	`(0+0i)`	Beware!
#to_f	`"١".to_f`	`1.0`	`0.0`	Beware!
#to_i	`"١".to_i`	`1`	`0`	Beware!
#to_r	`"١".to_r`	`(1/1)`	`(0/1)`	Beware!
#to_sym	`"ä".to_sym`	`:ä`	`:ä`	OK
#tr	`"ä".tr("a", "b")`	`"ä"`	`"b̈"`	Beware!
#unpack	`"ä".unpack("CCC")`	`[97, 204, 136]`	`[97, 204, 136]`	OK
#upto	`"ä".upto("c̈").to_a`	`["ä", "b̈", "c̈"]`	`["ä", "b̈", "c̈"]`	OK
#valid_encoding?	`"ä".valid_encoding?`	`true`	`true`	OK

What to do next:

Try Honeybadger for FREE

Honeybadger helps you find and fix errors before your users can even report them. Get set up in minutes and check monitoring off your to-do list.
Start free trial
Easy 5-minute setup — No credit card required
Get the Honeybadger newsletter

Each month we share news, best practices, and stories from the DevOps & monitoring community—exclusively for developers like you.
Include latest Ruby articles

Starr Horne

Starr Horne is a Rubyist and Chief JavaScripter at Honeybadger.io. When she's not neck-deep in other people's bugs, she enjoys making furniture with traditional hand-tools, reading history and brewing beer in her garage in Seattle.

@starrhorne Author Twitter

More Ruby articles

Apr 09, 2024 Account-based subdomains in Rails
Mar 12, 2024 Let's build a Hanami app
Mar 05, 2024 How to deploy a Rails app to Render
Feb 12, 2024 Visualizing Ahoy analytics in Rails
Feb 07, 2024 Building reusable UI components in Rails with ViewComponent
Jan 17, 2024 Composite primary keys in Rails
Dec 14, 2023 Deploy a Rails app to a VPS with Kamal
Nov 20, 2023 How to build your own user authentication system in Rails
Nov 02, 2023 How to organize your code using Rails Concerns
Oct 16, 2023 FactoryBot for Rails testing

Stop wasting time manually checking logs for errors!

Try the only application health monitoring tool that allows you to track application errors, uptime, and cron jobs in one simple platform.

Know when critical errors occur, and which customers are affected.
Respond instantly when your systems go down.
Improve the health of your systems over time.
Fix problems before your customers can report them!

As developers ourselves, we hated wasting time tracking down errors—so we built the system we always wanted.

Honeybadger tracks everything you need and nothing you don't, creating one simple solution to keep your application running and error free so you can do what you do best—release new code. Try it free and see for yourself.

Start free trial

Simple 5-minute setup — No credit card required

Learn more

"We've looked at a lot of error management systems. Honeybadger is head and shoulders above the rest and somehow gets better with every new release."
— Michael Smith, Cofounder & CTO of YvesBlue

Honeybadger is trusted by top companies like:

“Everyone is in love with Honeybadger ... the UI is spot on.”

Molly Struve, Sr. Site Reliability Engineer, Netflix

Start free trial

Are you using Sentry, Rollbar, Bugsnag, or Airbrake for your monitoring? Honeybadger includes error tracking with a whole suite of amazing monitoring tools — all for probably less than you're paying now. Discover why so many companies are switching to Honeybadger here.

Start free trial

Stop digging through chat logs to find the bug-fix someone mentioned last month. Honeybadger's built-in issue tracker keeps discussion central to each error, so that if it pops up again you'll be able to pick up right where you left off.

Start free trial

“Wow — Customers are blown away that I email them so quickly after an error.”

Chris Patton, Founder of Punchpass.com

Start free trial

Testing Ruby's Unicode Support

Unicode tests with Ruby 2.4.0

Starr Horne

More Ruby articles

Get Honeybadger's best Ruby articles in your inbox