This is a good talk. Really gets into the details of how things differ from the classical SaaS or consumer product.
I've been doing reliability for most of my career, and have always been able to hide behind, "We're not a bank, if we lose a few requests it doesn't matter". They can't do that. :)
One advantage that they have is that the market closes, so they can do maintenance that takes the whole system down, but when you're running a global consumer product, it's a lot harder to do that without pushback.
So for most of us, our stress is around zero downtime maintenance, and theirs is around never dropping a request when the system is live.
Yeah, I work on systems with reliability requirements like this at a large bank.
There are multiple layers of controls and manual interventions and things, which while absolutely painful, slow, expensive and shitstorm-conjuring -- are ultimately the final authority on some failures.
For e.g, in payments -- every single settlement or clearing anomaly is looked at by a real human, and rectified/rebooked manually.
So, yeah, the stakes can be really high when you have a couple billion in memory on your server, but -- it's just a system.
there’s a move now towards 24/7 trading. I guess we’ll see how the rigors of the trading environment mesh with zero down time. I’m sure the rollout will be slow and steady.
I've seen that. I suspect the exchanges will never go for it for this exact reason -- they need downtime for maintenance. But if does go through, it will be a fun challenge to get 100% uptime!
I've always said that with infinite money we could get 100% uptime, but no one has infinite money. Trading firms are about as close as I can imagine to infinite money though.
An amusing, moderately expensive solution that might actually work would be to have a weekday system and a weekend system. Think of it as a spare D/R system that you intentionally swap twice a week :)
If done right, it would be a complete separate system. Separate IP addresses and all.
I hated my time as an SRE. But … can’t it be done with some combination of canaries and blue green deployments and extensive testing? Where when things look good you just swap all the traffic to the good stuff keeping the rollback hot etc etc?
That's how we got 99.99% at Netflix. And it cost a lot of money. But a canary implies that something may go wrong and you have to roll back. The canary is still production traffic, so some transactions would fail, which isn't allowed for this kind of workload.
I image you'd have to use shadow execution, where you roll out a full second copy, run every transaction through both, and compare the results. And then, only after a certain time, switch traffic to the new infra and tear down the old.
But you would need a ton of extra hardware (more than double) and a lot of ways to keep data in sync. And of course if you put an LLM or other non-deterministic system in there, that's a whole other can of worms.
Folks that keep the lights on 24/7 aka SREs are super heroes that wear capes. Thank you for your service.
I couldn’t do it. I like infra and all but it’s just not my cup of tea. Def true that in a trading pov the trade must be executed. It must settle. It must work. Or capital flight will be huge.
Isn't the plan more like 23/5 like is already the case for several markets?
I can't see the standard sessions moving more 9:30am/4pm weekdays to 24/7. I take it they'd still let, at least, one hour off for technical reasons.
If I'm not mistaken it's the reason several markets are 23/5 and not 24/5: that one hour of downtime is basically for servers/maintenance right? (maybe someone can chime in)
P.S: I take it technically there's 24/7 trading already seen that cryptocurrencies exchanges are opened 24/7 (I'm not sure: but I think that's the case) but I don't think those do anywhere near the volume of, say, options trading on equities during standard sessions (40 Gbit/s with peak over 70 Gbit/s for the full options feed).
This is a good talk. Really gets into the details of how things differ from the classical SaaS or consumer product.
I've been doing reliability for most of my career, and have always been able to hide behind, "We're not a bank, if we lose a few requests it doesn't matter". They can't do that. :)
One advantage that they have is that the market closes, so they can do maintenance that takes the whole system down, but when you're running a global consumer product, it's a lot harder to do that without pushback.
So for most of us, our stress is around zero downtime maintenance, and theirs is around never dropping a request when the system is live.
Yeah, I work on systems with reliability requirements like this at a large bank.
There are multiple layers of controls and manual interventions and things, which while absolutely painful, slow, expensive and shitstorm-conjuring -- are ultimately the final authority on some failures.
For e.g, in payments -- every single settlement or clearing anomaly is looked at by a real human, and rectified/rebooked manually.
So, yeah, the stakes can be really high when you have a couple billion in memory on your server, but -- it's just a system.
And it will fail, and we plan for it to do so.
there’s a move now towards 24/7 trading. I guess we’ll see how the rigors of the trading environment mesh with zero down time. I’m sure the rollout will be slow and steady.
I've seen that. I suspect the exchanges will never go for it for this exact reason -- they need downtime for maintenance. But if does go through, it will be a fun challenge to get 100% uptime!
I've always said that with infinite money we could get 100% uptime, but no one has infinite money. Trading firms are about as close as I can imagine to infinite money though.
An amusing, moderately expensive solution that might actually work would be to have a weekday system and a weekend system. Think of it as a spare D/R system that you intentionally swap twice a week :)
If done right, it would be a complete separate system. Separate IP addresses and all.
I hated my time as an SRE. But … can’t it be done with some combination of canaries and blue green deployments and extensive testing? Where when things look good you just swap all the traffic to the good stuff keeping the rollback hot etc etc?
That's how we got 99.99% at Netflix. And it cost a lot of money. But a canary implies that something may go wrong and you have to roll back. The canary is still production traffic, so some transactions would fail, which isn't allowed for this kind of workload.
I image you'd have to use shadow execution, where you roll out a full second copy, run every transaction through both, and compare the results. And then, only after a certain time, switch traffic to the new infra and tear down the old.
But you would need a ton of extra hardware (more than double) and a lot of ways to keep data in sync. And of course if you put an LLM or other non-deterministic system in there, that's a whole other can of worms.
Like I said, a fun problem to solve. :)
Folks that keep the lights on 24/7 aka SREs are super heroes that wear capes. Thank you for your service.
I couldn’t do it. I like infra and all but it’s just not my cup of tea. Def true that in a trading pov the trade must be executed. It must settle. It must work. Or capital flight will be huge.
> there’s a move now towards 24/7 trading.
Isn't the plan more like 23/5 like is already the case for several markets?
I can't see the standard sessions moving more 9:30am/4pm weekdays to 24/7. I take it they'd still let, at least, one hour off for technical reasons.
If I'm not mistaken it's the reason several markets are 23/5 and not 24/5: that one hour of downtime is basically for servers/maintenance right? (maybe someone can chime in)
P.S: I take it technically there's 24/7 trading already seen that cryptocurrencies exchanges are opened 24/7 (I'm not sure: but I think that's the case) but I don't think those do anywhere near the volume of, say, options trading on equities during standard sessions (40 Gbit/s with peak over 70 Gbit/s for the full options feed).