GCS Performance Experiments
Recently I've been digging into some performance optimizations with some code that transfers dozens of assets into a bucket in Google Cloud Storage (GCS)
The code in question is using Google's Ruby gem, and uploads a set of files in sequence, something like this:
require "google/cloud/storage" storage = Google::Cloud::Storage.new(...) bucket = storage.bucket(...) files.each do |filename| bucket.create_file(filename, filename) end
For this naive implementation, it takes about 45 seconds to upload 75 files, totalling about 350k of data. That works out to about 600ms per file! This seemed awfully slow, so I wanted to understand what was going on.
(note that all the timings here were collected on my laptop running on my home wifi)
Jump down to the TL;DR for spoilers, or keep reading for way too much detail.
Simplify the dataset
First, I wanted to see if the size of the files had any impact on the speed here, so I created a test with just 50 empty text files:
$ touch assets/{1..50}.txt $ time ruby gcs.rb !$ ruby gcs.rb assets/{1..50}.txt 0.61s user 0.20s system 2% cpu 27.951 total
That's still about 28 seconds, or about 550ms per asset, which seems super slow for a set of empty files.
Start digging
Looking at the http traffic generated by running this code, it appears as though Google's gem is creating two(!!) new TCP / https connections per asset. The default behaviour of the gem is to use a "resumeable upload", where one connection is issued to start the upload, and then a second connection to transfer the data and finalize the upload. It doesn't appear as though any connection pooling is happening either.
It looks like there's room for some improvement.
A few ideas came to mind:
- Use a single request per asset
- Use connection pooling
- Run requests in parallel (using threading or async)
- Use HTTP/2
Single request per asset
The GCS API is pretty simple, so it's straightforward to implement a basic upload function in Ruby.
def upload_object(bucket, filename) Net::HTTP.post( URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}"), File.read(filename), {"Authorization" => "Bearer #{ENV["AUTH"]}"} ) end
This uses one connection per asset, and brings the time down to about 11 seconds for our 50 empty assets, which is less than half of the naive version.
Re-using the same connection
Net::HTTP
supports re-using the same connection, we just need to
restructure the code a little bit:
def upload_object(client, bucket, filename) uri = URI("https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}") req = Net::HTTP::Post.new(uri) req["Authorization"] = "Bearer #{ENV["AUTH"]}" req.body = File.read(filename) client.request(req) end Net::HTTP.start("storage.googleapis.com", 443, use_ssl: true) do |http| files.each do |filename| upload_object(http, bucket, filename) end end
This runs a bit faster now, in 8 seconds total.
Parallelization
Ruby has a few concurrency models we can play with. I try to avoid
threading wherever possible, and use async
libraries. Luckily, Ruby's
async
gem handles wrapping Net::HTTP
quite well:
Async do barrier = Async::Barrier.new files.each do |filename| barrier.async do resp = upload_object(bucket, filename) end end barrier.wait end
Now we can finish all 50 uploads in about 1.3s
HTTP/2
There are various options for making HTTP2 requests in Ruby. One we can use
is async-http
, which integrates
well with the async
gem used above. Another gem that's worked well for me
is httpx
.
Async do barrier = Async::Barrier.new internet = Async::HTTP::Internet.new files.each do |filename| barrier.async do resp = internet.post( "https://storage.googleapis.com/upload/storage/v1/b/#{bucket}/o?name=#{filename}", {"Authorization" => "Bearer #{ENV["AUTH"]}"}, File.read(filename)) end end barrier.wait end
This finishes all 50 requests in 0.4s!
Looking back at our initial data set which took 45 seconds to run, we can now do in 0.9 seconds.
TL;DR
To summarize, here are the times for uploading 50 empty files to a bucket:
Method | Time |
---|---|
naive (google gem) | 28s |
single request per asset | 11s |
single http connection | 8s |
async | 1.3s |
HTTP/2 | 0.4s |
I'm really shocked at how much faster HTTP/2 here is.
Consumers of Google Cloud APIs should take a look at the libraries that they're using, and see if they can switch over to ones that support HTTP/2. I'm curious if other client libraries for the Google Cloud APIs support HTTP/2.
Can the Ruby gem support HTTP/2? There's an open issue on the github repo to switch the underlying client implementation to Faraday, which would allow one to use an HTTP/2 aware client under the hood. I've started working on a draft PR to see what would be involved in switching, but there are some major barriers at this point.
Comments