- concurrency is very hard
- .. but object storage "solves" most of that for you, handing you a set of semantics which work reliably
- single file throughput sucks hilariously badly
- .. because 1Gb is ridiculously large for an atomic unit
- (this whole thing resembles a project I did a decade ago for transactional consistency on TFAT on Flash, except that somehow managed faster commit times despite running on a 400Mhz MIPS CPU. Edit: maybe I should try to remember how that worked and write it up for HN)
- therefore, all of the actual work is shifted to the broker. The broker is just periodically committing its state in case it crashes
- it's not clear whether the broker ACKs requests before they're in durable storage? Is it possible to lose requests in flight anyway?
- there's a great design for a message queue system between multiple nodes that aims for at least once delivery, and has existed for decades, while maintaining high throughput: SMTP. Actually, there's a whole bunch of message queue systems?
I read this blog post and to help wrap my head around it I put together a simple TCP-based KV store with group commit, helped make it click for me.
Works great for Parquet.
Still, it's nbd. You can cache a billion Parquet header/footers on disk/ memory and get 90% of the performance (or better tbh).
You can achieve stupidly fast read/write operations if you do this right with a system that is shocking simple to reason about.
> Step 4: queue.json with an HA brokered group commit > The broker is stateless, so it's easy and inexpensive to move. And if we end up with more than one broker at a time? That's fine: CAS ensures correctness even with two brokers.
TBH this is the part that I think is tricky. Just resolving this in a way that doesn't end up with tons of clients wasting time talking to a broker that buffers their writes, pushes them, then always fails. I solved this at one point with token fencing and then decided it wasn't worth it and I just use a single instance to manage all writes. I'd again point to DeltaLake for the "good" design here, which is to have multiple manifests and only serialize compaction, which also unlocks parallel writers.
The other hard part is data deletion. For the queue it looks deadly simple since it's one file, but if you want to ramp up your scale and get multiple writers or manage indexes (also in S3) then deletion becomes something you have to slip into compaction. Again, I had it at one point and backed it out because it was painful.
But I have 40k writes per second working just fine for my setup, so I'm not worrying. I'd suggest others basically punt as hard as possible on this. If you need more writes, start up a separate index with its own partition for its own separate set of data, or do naive sharding.
[1] https://docs.cloud.google.com/storage/docs/samples/storage-m... [2] https://docs.aws.amazon.com/AmazonS3/latest/API/API_RenameOb... [3] https://fractalbits.com/blog/why-we-built-another-object-sto...
> RenameObject is only supported for objects stored in the S3 Express One Zone storage class.
Ah interesting, I don't use this but I bet in a year+ AWS will have this everywhere lol S3 is just too good.
That said, I suspect that you can probably beat SQS for a number of use cases, and definitely if you want to hold onto the data long term or search over it then S3 has huge options there.
Performance will be extremely solid unless you need your worst case latency for "push -> pop" to be very tight in your p90.
If you want consistently really low latency/ can't tolerate a 50ms spike, don't retain tons of data, have <10K/s writes, and need complex indexing that might change over time, Postgres is probably what you want (or some other thing). If you know how your data should be indexed ahead of time, you need to store a massive amount, you care more about throughput than a latency spike here or there, or really a bunch of other use cases probably, S3 is just an insanely powerful primitive.
Insane storage also unlocks new capabilities. Immutable logs unlock "time travel" where you can ask questions like "what did the system look like at this point?" since no information is lost (unless you want to lose it, up to you).
Everything about a system like this comes down to reducing the cost of a GET. Bloom filters are your best friend, metadata is your best friend, prefetching is a reluctant friend, etc.
I'm not sure what I'm building. I had this idea years ago before S3 CAS was a thing and I was building a graph database on S3 with the fundamental primitive being an immutable event log (at the time using CRDTs for merge semantics, but I've abandoned that for now) and then maintaining an external index in Scylla with S3 Select for projections. Years later, I have fun poking at it sometimes and redesigning it. S3 CAS unlocked a lot of ways to completely move the system to S3.
Could you expand on that? Even if it wasn't the approach you stuck with, I'm curious.
Writes were not visible until compaction in this system. At compaction time, tokens would be checked and writes for older tokens would be rejected, so even if two nodes thought that they owned a 'place' in the ring, only writes for the higher value would be accepted. Soooomething like that. I ended up disliking this because it had undesirable failure modes like lots of stale/ wasted writes, and the code sucked.
In any organisation its good to make choices for simplicity rather than small optimisations - you're optimising maintenance, incident resolution, and development.
Typically I have a small pg server for these things. It'll work out slightly more expensive than this setup for one action, yet it will cope with so much more - extending to all kinds of other queues and config management - with simple management, off the shelf diagnostics etc.
While the object store is neat, there is a confluence of factors which make it great and simple for this workload, that may not extend to others. 200ms latency is a lot for other workloads, 5GB/s doesn't leave a lot of headroom, etc. And I don't want to be asked to diagnose transient issues with this.
So I'm torn. It's simple to deploy and configure from a fresh deployment PoV. Yet it wouldn't be accepted into any deployment I have worked on.
Similar idea but you have the power of S3 scale (if you really need it). For context, I do not work at WS. My company switched to it recently and we've seen great improvements over traditional Kafka.
This is not to say it's a bad system, but it's very precisely tailored for their needs. If you look at the original Kafka implementation, for instance, it was also very simple and targeted. As you bolt on more use cases and features you lose the simplicity to try and become all things to all people.
> It seems like this is an approach that trades off scale and performance for operational simplicity.
Yes, this is exactly it. Given that turbopuffer itself is built on the idea of object storage + stateless cache, we're all very comfortable dealing with it operationally. This design is enough for our needs and is much easier to be oncall for than adding an entirely new dependency would have been.
Conceptually that makes sense. How complicated is it to implement this failover logic in a safe way? If there are two processes, competing for CAS wins, is there not a risk that both will think they're non-leaders and terminate themselves?
1. Start
2. Load the queue.json from the object store
3. Receive request(s)
3. Edit in memory JSON with batch data
4. Save data with CAS
5. On failure not due to CAS, recover (or fail)
6. On success, succeed requests and go to 3
7. On failure due to CAS, fail active requests and terminate
The client should have a retry mechanism against the broker (which may include looking up the address again).
From the brokers PoV, it will never fail a CAS until another broker wins a CAS, at which point that other broker is the leader. If it does fail a CAS the client will retry with another broker, which will probably be the leader. The key insight is that the broker reads the file once, it doesn't compete to become leader by re-reading the data and this is OK because of the nature of the data. You could also say that brokers are set up to consider themselves "maybe the leader" until they find out they are not, and losing leadership doesn't lose data.
The mechanism to start brokers is only vaguely discussed, but if a host-unreachable also triggers a new broker there is a neat from-zero scaling property.
Further, this system (as described) scales best when writes are colocated (since it maximizes throughput via buffering). So even just by having a second writer you cut your throughput in ~half if one of them is basically dead.
If you split things up you can just do "merge manifests on conflict" since different writers would be writing to different files and the manifest is just an index, or you can do multiple manifests + compaction. DeltaLake does the latter, so you end up with a bunch of `0000.json`, `0001.json` and to reconstruct the full index you read all of them. You still have conflicts on allocating the json file but that's it, no wasted flushing. And then you can merge as you please. This all gets very complex at this stage I think, compaction becomes the "one writer only" bit, but you can serve reads and writes without compaction.
https://doi.org/10.14778/3415478.3415560
Note that since this paper was published we have gotten S3 CAS.
Alternatively, I guess just do what Kafka does or something like that?
I'd love to see a full sample implementation based on s3 + ecs - just to study how it works.
We don't have a relational database, otherwise that would work great for a queue! You can imagine us continuing to iterate here to Step 5, Step 6, ... Step N over time. The tradeoff of each step is complexity, and complexity has to be deserved. This is working exceptionally well currently.
Love this approach
What actually _needs_ to be in the database? I've never gone as far as building a job queue on top of object storage, but have been involved in building surprisingly consistent and reliable systems with object storage.
Among other problems I knew the next thing we were going to have to do was autoscaling and the system we had for call and response was a mess from that respect. Unanswered questions were: How do you know when all agents have succeeded, how do you avoid overwriting your peers’ data, and what do you do with agents that existed yesterday and don’t today?
I ended up rewriting all of the state management data so that each field had one writer and one or more readers. It also allowed me to move the last live service call for another service and decommission it. Instead of having a admin service you just called one of the peers at random and elected it leader for the duration of that operation. I also arranged the data so the leader could watch the parent key for the roll call and avoid needing to poll.
Each time a task was created the leader would do a service discovery call to get a headcount and then wait for everyone to set a suggests or failure state. Some of these state transitions were idempotent, so if you reissued a task you didn’t need to delete the old results. Everyone who already completed it would noop, and the ones that failed or the new servers that joined the cluster would finish up. If there was a delete operation later then the data would be purged from the data set and the agents, a subsequent call would be considered new.
Long story short, your CS program should have distributed computing classes because this shit is hard to work out from first principles when you don’t know what the principles even are.
Right. You can issue a write that will only be accepted if a condition is matched, like the etag of the object matching your expectation. If it doesn't match, your object was invalidated.
For instance backing off 53 ms is a lot of latency. Backing off 1 μs may still result in a collision, requiring two or three backoffs before it resolves, where a longer pause on the first backoff might have resulted in quicker recovery.
Optimistic locking is a statistical model just like bloom filters are, and scaling something 20 or 100 times higher is a hell of a lot of runway. Especially if the failure mode is a person sharing someone else’s account or using a laptop and tablet at exactly the same time.