Skip to main content

Posts about ruby

Packing bits with Rust & Ruby

The missing C of CHD

One element of the CHD (compress-hash-displace) algorithm that I didn't implement in my previous post was the "compress" part.

This algorithm generates an auxiliary table of seeds that are used to prevent hash collisions in the data set. These seeds need to be encoded somehow and transmitted along with the rest of the data in order to perform lookups later on. The number of seeds (called r in the algorithm) here is usually proportional to the number of elements in the input. Having a larger r means that it's easier to compute seeds that avoid collisions, and therefore faster to compute the perfect hash. Reducing r results in a more compact data structure at the expense of more compute up-front.

Packing seeds

Seeds are generally tried starting from 0, and typically don't end up being very large. Encoding these values as a basic array of 8/16/32-bit integers is a waste of space.

lots of zeros

I wanted to improve on my implementation of efficient encoding of hashes by doing some simple bit packing of the seeds.

The basic idea is that for a set of integers, we find the maximum value, and therefore the maximum number of bits (b) needed to represent that value. We can then encode all the integers using b bits instead of a fixed number of bits.

less zeros

There's a Rust crate bitpacking that does exactly this! And it runs super duper fast, assuming that you can arrange your data into groups of 32/128/256 integers. The API is really simple to use as well:

use bitpacking::{BitPacker, BitPacker4x};

fn main() {
    let data: Vec<u32> = (0..128).map(|i| i % 8).collect();
    let packer = BitPacker4x::new();
    let num_bits = packer.num_bits(&data);
    let mut compressed = vec![0u8; 4 * BitPacker4x::BLOCK_LEN];
    let len = packer.compress(&data, compressed.as_mut_slice(), num_bits);
    compressed.truncate(len);

    println!("Compressed data: {:?}", compressed);
}

Bridging the gap between Rust & Ruby

I wanted to use this from Ruby code though...time to bust out magnus!

Magnus is a crate which makes it really easy to write Ruby extensions using Rust. It takes care of most of the heavy lifting of converting to/from Ruby & Rust types.

#[magnus::wrap(class="BitPacking::BitPacker4x")]
struct BitPacker4x(bitpacking::BitPacker4x)

impl BitPacker4x {
  // ...
  fn compress(
      ruby: &Ruby,
      rb_self: &Self,
      decompressed: Vec<u32>,
      num_bits: u8,
  ) -> RString {
      let mut compressed = vec![0u8; 4 * Self::BLOCK_LEN];
      let len = rb_self.0 // refers to underlying bitpacking::BitPacker4x struct
          .compress(&decompressed, compressed.as_mut_slice(), num_bits);
      compressed.truncate(len);
      ruby.str_from_slice(compressed.as_slice())
  }
}

This lets me write Ruby code like this:

data = 128.times.map { |i| i % 8 }
packer = BitPacking::BitPacker4x.new
num_bits = packer.num_bits(data)
compressed = packer.compress(data, num_bits)

Here we have these 128 integers represented in 48 bytes, or 3 bits per integer.

BitPacking gem

I've packaged this up into the bitpacking gem.

I hope you find this useful!

Efficient Hash Lookups: Adventures with Benchmarks!

Efficient data retrieval is crucial for high performance applications. Whether you're configuring a complex system or handling large datasets, the speed of hash lookups can significantly impact performance. In this post, we'll explore various methods to optimize hash lookups and benchmark their performance.

You can find my benchmarking code available here.

The Challenge

Let's say that you have a hash (aka dictionary aka map aka associative array; something with key/value pairs) representing configuration for your system.

At runtime, let's assume that your application only needs access to a small number of these configuration items to handle a given request.

Let's also say that this hash can get pretty large; some users have very complex configurations.

{
  "setting_1":     "value 1",
  "setting_2":     "value 2",
  /* ... */
  "setting_10000": "value 10000",
}

A common way of storing this configuration is in a JSON file. JSON is really simple to use for this type of problem. Just use something like JSON.parse to parse your file, and do a lookup for whatever key you want:

config_data = File.read("config.json")
config = JSON.parse(config_data)
puts config["setting_500"]
# => "value 500"

Awesome, what's the problem?

Assuming that we just want to fetch a small number of items from this configuration, there are a few problems here:

  1. We have to load the entire file into memory
  2. We then have to parse the entire file to get our hash object. This involves inserting every item in configuration into a new hash object, which is an O(n) operation.

If our configuration is quite large, then the parsing & hash construction time will be far greater than our lookup time.

$ ruby benchmark.rb -f json -s hash
Calculating -------------------------------------
                load    829.185 (±30.0%) i/s    (1.21 ms/i)
               parse     24.779 (± 8.1%) i/s   (40.36 ms/i)
             get_key     4.242M (± 3.7%) i/s  (235.72 ns/i)
  load/parse/get_key     22.561 (± 8.9%) i/s   (44.32 ms/i)

(I've removed some superfluous information from the output snippets here in the interests of clarity; the number I'm paying attention to is the time per iteration represented in the last column.)

Lookups into the hash, once created, are very fast: about 200ns per lookup.

However, for 100,000 entries, just parsing the data and creating the hash takes about 40ms on my system.

In fact, the hash construction time is the fundamental issue to overcome when trying to speed up our overall time here. Simply creating a hash from the data takes about 20ms on my system:

$ ruby benchmark-to_h.rb < output/array.json
Calculating -------------------------------------
                to_h     37.764 (±18.5%) i/s   (26.48 ms/i)

It makes sense when you think about it. Before we can get our hash object containing our configuration, we need to iterate through each key/value pair and insert it into the hash object. Only then can we do a lookup for the key we're interested in.

How can we make this better?

A bunch of different ideas come to mind:

We can try different ways to encode the data. For example: using MessagePack, protobufs, or capnproto. Each of these use more efficient ways to serialize/de-serialize the data, speeding up the parsing time.

Another idea is to encode our data as a sorted list of key/value pairs instead of as a hash, and use a binary search to lookup the key that we want.

Or, we can try and use perfect hashes; here we would encode our data as a list of key/value pairs as above, but augmented with additional information that makes it possible to treat the array as a pre-computed hash table.

Other ideas: sqlite, dbm/lmdb. These benefit from encoding the index within the file itself, so we don't need to reconstruct a hash table/index again when loading the data. Storing more structured data in these is more challenging though, and lookups in these types of data stores are typically O(log N) instead of O(1)

MessagePack

MessagePack is an easy thing to try first since it's almost a drop-in replacement for JSON in our use case.

$ ruby benchmark.rb -f msgpack -s hash
Calculating -------------------------------------
                load    679.466 (± 5.3%) i/s    (1.47 ms/i)
               parse     27.958 (± 3.6%) i/s   (35.77 ms/i)
             get_key     4.692M (± 3.4%) i/s  (213.13 ns/i)
  load/parse/get_key     27.031 (± 3.7%) i/s   (36.99 ms/i)

Well, we've improved the parsing time by about about 5ms, which is a nice improvement, but still pretty slow overall. Again, we have that large hash construction time to overcome, which MessagePack doesn't help us with.

Protobufs

To experiment with Protobufs, I've created a simple definition using the native map type:

syntax = "proto3";

message PBHash {
  map<string, string> data = 1;
}

The results here are surprising:

$ ruby benchmark.rb -f pb -s hash
Calculating -------------------------------------
                load      1.311k (±35.9%) i/s  (762.93 μs/i)
               parse      21.496 (± 4.7%) i/s   (46.52 ms/i) # !!
             get_key      3.098M (± 3.2%) i/s  (322.75 ns/i)
  load/parse/get_key      20.707 (± 4.8%) i/s   (48.29 ms/i)

It looks like parsing protobuf maps are significantly slower than parsing the equivalent JSON or msgpack objects.

Encode as an array

Let's see what happens if we encode our data as an array.

This may improve parsing time, since we don't need to construct a large hash object in memory. The trade-off is that lookup times become O(log n) instead of O(1), assuming we can sort our input. As a first approximation, we can compare times to load & parse the array.

Using JSON I see marginal improvement to load & parse the data (~30ms), but using Protobufs we can load & parse the data in about 6ms. Unfortunately I couldn't get good measurements for the full load/parse/get_key with protobufs; it seems like doing a binary search for elements in the protobuf arrays is relatively slow in Ruby. Fetching individual items is fast.

ruby -Iproto benchmark.rb -f pb -s array
Calculating -------------------------------------
                load     1.641k (±10.7%) i/s  (609.32 μs/i)
               parse    163.886 (± 3.1%) i/s    (6.10 ms/i)
             get_key     27.325 (±11.0%) i/s   (36.60 ms/i)
  load/parse/get_key      3.000 (±33.3%) i/s  (333.32 ms/i) # ???

Perfect hashes

Maybe we can get the best of both worlds by using a perfect hash to order our data in a particular way to make it cheap to find?

There are several ways of constructing a perfect hash; the CHD algorithm is fairly easy to implement (without the "compress" part of compress-hash-displace) , and can compute a minimal perfect hash function in O(n) time.

The way it works is that we compute a secondary array of seeds that are used to prevent collisions between keys in the data set. To find the index of an element by its key, we compute:

# Find index for `key`
seed = seeds[hashfunc(seed: 0, data: key) % seeds.size]
index = hashfunc(seed: seed, data: key) % data.size]
value = data[index]

The results of using CHD with Protobufs are very good (using cityhash as the hashing function):

$ ruby benchmark.rb -f pb -s chd
Calculating -------------------------------------
                load     1.920k (± 7.0%) i/s  (520.85 μs/i)
               parse    173.076 (± 4.0%) i/s    (5.78 ms/i)
             get_key   491.510k (± 1.9%) i/s    (2.03 μs/i)
  load/parse/get_key    154.828 (± 1.9%) i/s    (6.46 ms/i)

Summary

Wow, that was a lot! What did we learn?

Benchmarks are hard! I wouldn't take any results here as absolute proof that X is better or worse than Y. Your data and use cases are most likely very different.

Using perfect hashes or binary search in an array of key/value pairs look like they're much faster in general than simple lookups into a hash object. The time to construct the hash object plays such a large part in the overall timings.

I'm surprised at the performance of protobufs in generally. They're much slower than I expected, in everything except the perfect hash (CHD) case. It makes me wonder if something is wrong with my implementation, or Ruby's protobuf gem.

$  ruby benchmark-final.rb
Comparison:
load/parse/get_key pb chd:        124.3 i/s
load/parse/get_key msgpack array: 50.2 i/s - 2.48x  slower
load/parse/get_key msgpack chd:   41.1 i/s - 3.02x  slower
load/parse/get_key json array:    37.3 i/s - 3.33x  slower
load/parse/get_key json chd:      27.9 i/s - 4.46x  slower
load/parse/get_key msgpack hash:  23.8 i/s - 5.22x  slower
load/parse/get_key json hash:     20.7 i/s - 6.00x  slower
load/parse/get_key pb hash:       19.4 i/s - 6.40x  slower
load/parse/get_key pb array:      3.1 i/s - 40.07x  slower

To follow up, I'd like to explore using capnproto, and running similar benchmarks in other languages to see if the protobuf performance improves.

Learning Ruby as an experienced Python developer

As I mentioned in my previous post, in my new role at Shopify, I've been doing backend development in Ruby. Previously I had been working nearly exclusively with Python for over ten years, so I was a bit nervous about the move.

In addition to my regular work, I also tried to solve other types of problems using Ruby. Advent of code was a really great way to learn a new language.

After nearly two years in the new role, I'd like to share some of my experiences, hopefully as a way to help and encourage others who would like to learn a new language, or are worried about moving into a new role, but don't know some specific technology.

Early pains

The first few weeks with Ruby were pretty tough. Luckily, Ruby shares some similarities with Python that make it a bit more approachable at first glance:

  • Dynamically typed, interpreted language
  • Class / method syntax is similar

Method calls don't need ()

One of the first things that tripped me up was not understanding that using () to call a method is not required in Ruby.

def greet
  puts "Hello, world!"
end

greet
# => "Hello, world!"

In fact, () are optional when passing arguments as well!

def greet(name)
  puts "Hello, #{name}"
end

greet "Chris"
# => "Hello, Chris"

This can be used to build some very nice DSL features, resulting in code that is much more readable. e.g. Rail's delegate method

require "rails" # to get delegate
class Message
  def greet
    puts "Hello!"
  end
end

class Greeter
  delegate :greet, to: :message
  attr_accessor :message
  def initialize
    @message = Message.new
  end
end

Greeter.new.greet
# => "Hello!"

Implicit return

Like Rust, the result of the last expression in Ruby is used as the return value of a function.

def add_one(x)
  x + 1
end

add_one(2)
# => 3

Strings & Symbols

Ruby has a concept called symbols, which are a kind of identifier using the : prefix. They're often used to reference method names, since you can't get a reference to a method just by accessing it by name like in Python. E.g. obj.foo will call the foo method on obj in Ruby, whereas it will give you a reference to the foo method in Python. The equivalent in Ruby would be obj.method(:foo)

Symbols are also used for named parameters in method calls. E.g.

def greet(message:)
  puts "Hello #{message}"
end

greet(message: "world!")
# => "Hello world!"

However, where symbols presented me with the steepest learning curve is how they're handled in Hash (i.e. dict) literals.

a = "key"
h = {
  a: 1,
  a => 2,
}
# => { :a=>1, "key"=>2 }

It's extremely easy to get the two ways of defining a value mixed up. I've wasted an embarrassing number of hours on bugs caused by mixing up strings and symbols as the keys in hashes.

Range expressions

Ruby has nice built-in syntax for ranges:

(0..5).to_a
# => [0, 1, 2, 3, 4, 5]
(0...5).to_a
# => [0, 1, 2, 3, 4]

These are super convenient, but I almost always forget which form is inclusive and which is exclusive.

A-ha! moments

Blocks

Before really learning Ruby, I remember trying to read up on what blocks were...and not really getting it.

The best way I can come up with to explain them now is that they're a kind of anonymous function / closure.

Part of my confusion was not understanding that there are a few ways of defining and calling blocks.

These are equivalent:

mymap(0...5) do |x|
  x ** 2
end

# is the same as

mymap(0...5) { |x| x ** 2 }

The block's arguments are passed via the identifiers between the pipe (|) symbols.

In both cases, the mymap method is being passed a block, which in our case gets executed once per element (but that's completely up to the implementation of mymap). The block can be named as an explicit function parameter, and this could be written as:

def mymap(obj, &block)
  result = []
  for i in obj
    result.push(block.call(i))
  end
  result
end

The block can also be passed implicitly (check using the block_given? method) and called via yield:

def mymap(obj)
  result = []
  for i in obj
    result.push(yield i)
  end
  result
end

Once I wrapped my head around blocks, I found them to be very useful!

"&:" idiom

Ruby has this really neat shorthand for creating a block that calls a method on an object: "&:". It's used like this:

["hello", "world"].map(&:upcase)
# => ["HELLO", "WORLD"]

There are a few things going on here:

  • :upcase is a Symbol referring to the upcase method on the object
  • & tries to convert its argument to a kind of closure using the argument's .to_proc method. Symbol#to_proc returns a closure that calls the given method on the passed in object.

The net result is something equivalent to

["hello", "world"].map { |s| s.send(:upcase) }
# OR
["hello", "world"].map { |s| s.upcase }

Brian Storti explains this in much more detail in his blog post.

Enumeration

Ruby has fantastic enumeration primitives built in, just checkout the Enumerable module. Most of the basic container types in Ruby support Enumerable; when combined with blocks, this makes filtering and transforming data in Ruby a joy.

It also fits my brain better than Python's generator / comprehension expressions. When writing Python code to transform data, I often found that I was writing some code, then backing the cursor up to the beginning of the line.

In Ruby the data and logic flow from left to right, which makes it easy to chain thing together.

(0...5).map { |x| x ** 2 }

In Python the logic is on left but the data is on the right.

[x**2 for x in range(5)]

If I want to get only the square numbers that are even, I would simply add this in Ruby:

(0...5).map { |x| x ** 2 }.filter(&:even?)

Whereas in Python I would need to introduce more code before/after the expression I already have to achieve the same result:

[y for y in [x**2 for x in range(5)] if y % 2 == 0]

tl;dr

Ruby is a really nice language. I'm glad I've had the opportunity to learn it!

Python has a philosophy of "There should be one-- and preferably only one --obvious way to do it.". I think this helps with Python's readability at the cost of expressibility, elegance, and conciseness in some cases.

In Ruby there are often multiple ways of achieving the same thing. It has a richer vocabulary for expressing ideas in code. This richness allows for more elegant code in many cases, although this perhaps requires more effort on the part of the reader to understand the code.

GCS Performance Experiments

Recently I've been digging into some performance optimizations with some code that transfers dozens of assets into a bucket in Google Cloud Storage (GCS)

The code in question is using Google's Ruby gem, and uploads a set of files in sequence, something like this:

require "google/cloud/storage"
storage = Google::Cloud::Storage.new(...)
bucket = storage.bucket(...)

files.each do |filename|
  bucket.create_file(filename, filename)
end

For this naive implementation, it takes about 45 seconds to upload 75 files, totalling about 350k of data. That works out to about 600ms per file! This seemed awfully slow, so I wanted to understand what was going on.

(note that all the timings here were collected on my laptop running on my home wifi)

Jump down to the TL;DR for spoilers, or keep reading for way too much detail.

Simplify the dataset

First, I wanted to see if the size of the files had any impact on the speed here, so I created a test with just 50 empty text files:

$ touch assets/{1..50}.txt
$ time ruby gcs.rb !$
ruby gcs.rb assets/{1..50}.txt  0.61s user 0.20s system 2% cpu 27.951 total

That's still about 28 seconds, or about 550ms per asset, which seems super slow for a set of empty files.

Start digging

Looking at the http traffic generated by running this code, it appears as though Google's gem is creating two(!!) new TCP / https connections per asset. The default behaviour of the gem is to use a "resumeable upload", where one connection is issued to start the upload, and then a second connection to transfer the data and finalize the upload. It doesn't appear as though any connection pooling is happening either.

It looks like there's room for some improvement.

A few ideas came to mind:

  • Use a single request per asset
  • Use connection pooling
  • Run requests in parallel (using threading or async)
  • Use HTTP/2

Single request per asset

The GCS API is pretty simple, so it's straightforward to implement a basic upload function in Ruby.

def upload_object(bucket, filename)
  Net::HTTP.post(
    URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}"),
    File.read(filename),
    {"Authorization" => "Bearer #{ENV["AUTH"]}"}
  )
end

This uses one connection per asset, and brings the time down to about 11 seconds for our 50 empty assets, which is less than half of the naive version.

Re-using the same connection

Net::HTTP supports re-using the same connection, we just need to restructure the code a little bit:

def upload_object(client, bucket, filename)
  uri = URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}")
  req = Net::HTTP::Post.new(uri)
  req["Authorization"] = "Bearer #{ENV["AUTH"]}"
  req.body = File.read(filename)
  client.request(req)
end

Net::HTTP.start("storage.googleapis.com", 443, use_ssl: true) do |http|
  files.each do |filename|
    upload_object(http, bucket, filename)
  end
end

This runs a bit faster now, in 8 seconds total.

Parallelization

Ruby has a few concurrency models we can play with. I try to avoid threading wherever possible, and use async libraries. Luckily, Ruby's async gem handles wrapping Net::HTTP quite well:

Async do
  barrier = Async::Barrier.new
  files.each do |filename|
    barrier.async do 
      resp = upload_object(bucket, filename)
    end
  end
  barrier.wait
end

Now we can finish all 50 uploads in about 1.3s

HTTP/2

There are various options for making HTTP2 requests in Ruby. One we can use is async-http, which integrates well with the async gem used above. Another gem that's worked well for me is httpx.

Async do
  barrier = Async::Barrier.new
  internet = Async::HTTP::Internet.new

  files.each do |filename|
    barrier.async do
      resp = internet.post(
        "https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}",
        {"Authorization" => "Bearer #{ENV["AUTH"]}"},
        File.read(filename))
    end
  end
  barrier.wait
end

This finishes all 50 requests in 0.4s!

Looking back at our initial data set which took 45 seconds to run, we can now do in 0.9 seconds.

TL;DR

To summarize, here are the times for uploading 50 empty files to a bucket:

Method Time
naive (google gem) 28s
single request per asset 11s
single http connection 8s
async 1.3s
HTTP/2 0.4s

I'm really shocked at how much faster HTTP/2 here is.

Consumers of Google Cloud APIs should take a look at the libraries that they're using, and see if they can switch over to ones that support HTTP/2. I'm curious if other client libraries for the Google Cloud APIs support HTTP/2.

Can the Ruby gem support HTTP/2? There's an open issue on the github repo to switch the underlying client implementation to Faraday, which would allow one to use an HTTP/2 aware client under the hood. I've started working on a draft PR to see what would be involved in switching, but there are some major barriers at this point.