From Oniguruma to POSIX: The Regex Rift Between Ruby and C

Introduction

In the world of Kafka and its applications, utilizing regular expressions for topic subscriptions is a common strategy. This approach is particularly beneficial for dynamically managing data. For example it can be used to handle information from various zones without necessitating application redeployment for each new topic.

For instance, businesses operating across multiple zones in the USA might manage topics named:

  • us01.operational_events,
  • us02.operational_events,
  • us03.operational_events


and so on.

Karafka (Ruby and Rails Apache Kafka framework) facilitates such operations with its routing patterns feature, which leverages regular expressions for topic detection.

Simplifying Topic Subscriptions with Karafka

Sponsored

Handling pattern-based topics could be done by simply iterating through them explicitly:

class KarafkaApp < Karafka::App
  setup do |config|
    # ...
  end

  routes.draw do
    5.times do |i|
      topic "us0#{i}.operational_events" do
        consumer EventsConsumer
      end
    end
  end
end

However, Karafka's support for routing patterns provides a more elegant solution, allowing for the subscription to current and future topics matching a specified pattern:

class KarafkaApp < Karafka::App
  setup do |config|
    # ...
  end

  routes.draw do
    pattern(/usdd.operational_events/) do
      consumer EventsConsumer
    end
  end
end

This approach simplifies subscription management and enhances the system's flexibility and scalability, as it does not require restarts to detect and start consuming new topics matching the used patterns.

The Regex Conundrum

The simplicity of the above Ruby code belies the complexity beneath, especially the one related to operations between Ruby and C layers. An issue arose with a Karafka user where specific topic patterns were recognized in the Web UI interface but not by the consumer server. This problem was intriguing, as the regex used appeared straightforward:

/(usdd.)?operational_events/

Karafka uses this regex in Ruby for UI and routing logic and in C within librdkafka for topics subscription. I suspected a discrepancy in regex handling between Ruby and C. My investigation confirmed that the issue was not with the regex conversion between those languages but possibly with the regex engines' compatibility.

Deep Dive into libc Regex Definitions in librdkafka

Karafka's integration with librdkafka meant diving into C code to understand the regex engine usage. Librdkafka's build system hinted at the use of an external libc regex engine unless overridden:

# src/Makefile
ifneq ($(HAVE_REGEX), y)
SRCS_y += regexp.c
endif

My further exploration revealed that librdkafka defaults to using the libc regex engine, based on the POSIX standard, contrasting with Ruby's Oniguruma engine:

Sponsored
mkl_toggle_option "Feature" ENABLE_REGEX_EXT "--enable-regex-ext" "Enable external (libc) regex (else use builtin)" "y"

Ruby's Oniguruma engine is known for its advanced features and flexibility, which accommodate a wide range of regex patterns. In contrast, libc's POSIX regex engine is more basic. This difference became apparent when comparing the behavior of similar regex patterns in Ruby and libc, as demonstrated with the grep command in Linux, which uses the libc regex engine:

/(usdd.)?operational_events/.match?('us01.operational_events')
# Matches: us01.operational_events

/(us[0-9]{2}.)?operational_events/.match?('us01.operational_events')
# Matches: us01.operational_events

vs.

echo "us01.operational_events" | grep -E '(usdd.)?operational_events'
# No match

echo "us01.operational_events" | grep -E '(us[0-9]{2}.)?operational_events'
# Matches: us01.operational_events

Ruby correctly matched both versions of this regular expression, but grep could only detect one.

Addressing the Discrepancy for Karafka Users

Since librdkafka provides a compilation flag to replace the regexp engine, I could have just changed it. However, I decided against forcing all users to switch to the built-in regex implementation of librdkafka to avoid breaking changes. On top of that, while the engine change in librdkafka could make it's and Ruby regex match similarly, it doesn't mean there are no other edge cases. This is why, instead of doing this, I emphasized documenting this behavior, guiding users on how to create compatible regex patterns, and offering a testing methodology using both Ruby and POSIX regex standards:

def ruby_posix_regexp_same?(test_string, ruby_regex)
  posix_regex = ruby_regex.source
  ruby_match = !!(test_string =~ ruby_regex)
  grep_command = "echo '#{test_string}' | grep -E '#{posix_regex}' > /dev/null"
  posix_match = system(grep_command)
  comparison_result = ruby_match == posix_match

  puts "Ruby match: #{ruby_match}, POSIX match: #{posix_match}, Comparison: #{comparison_result}"
  comparison_result
end

This method facilitates easy verification of regex pattern compatibility across both engines:

ruby_posix_regexp_same?('test12.production', /dd/)
# Ruby match: true
# POSIX match: false
# Comparison: false

ruby_posix_regexp_same?('test12.production', /[0-9]{2}/)
# Ruby match: true
# POSIX match: true
# Comparison: true

ruby_posix_regexp_same?('test12.production', /[0-9]{10}/)
# Ruby match: false
# POSIX match: false
# Comparison: true

Conclusion

This issue reminded me that something beneath our code might work differently than we think. I realized that different software parts can have their own rules, like handling regular expressions that do not match our assumptions.

It's a valuable lesson. When testing software, it's essential to consider all possible scenarios, even the crazy ones. Before this, I might not have thought to check how different systems interpret the same regex patterns. Now, I know better. It's all about expecting the unexpected and ensuring our tests cover as much ground as possible.

And it isn't just about regex. It's a reminder always to dig deeper and question our assumptions. I'll throw even the wildest ideas into my testing mix. It's best to catch those sneaky differences before they catch us.

The post From Oniguruma to POSIX: The Regex Rift Between Ruby and C first appeared on Closer to Code.

Ubuntu Server Admin

Recent Posts

How we used Flask and 12-factor charms to simplify Canonical.com development

Our latest Canonical website rebrand did not just bring the new Vanilla-based frontend, it also…

55 minutes ago

Web Engineering: Hack Week 2024

At Canonical, the work of our teams is strongly embedded in the open source principles…

1 day ago

Ubuntu Weekly Newsletter Issue 873

Welcome to the Ubuntu Weekly Newsletter, Issue 873 for the week of December 29, 2024…

3 days ago

How to resolve WiFi Issues on Ubuntu 24.04

Have WiFi troubles on your Ubuntu 24.04 system? Don’t worry, you’re not alone. WiFi problems…

3 days ago

Remembering and thanking Steve Langasek

The following is a post from Mark Shuttleworth on the Ubuntu Discourse instance. For more…

3 days ago

How to Change Your Prompt in Bash Shell in Ubuntu

I don’t like my prompt, i want to change it. it has my username and…

3 days ago