All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.
Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
Bitflips are something that can happen in consumer-grade RAM, so that tracks (and it's comforting that wayward cosmic rays are a substantial reason for an application's crashes!), but on enterprise servers, they will run ECC RAM that is very resistant to bit flips.
This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.
I wouldn't expect bit flips to be a significant contributor to enterprise problems.
Bitflips specifically may not be; things like network issues, noisy neighbors, row/rack/host maintenance (leading to a downed and migrated host) absolutely are things that happen at high frequency at scale and cause your background level of errors to be more than 0.
Do you know of a single service at a single company that actually does that?
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
I've been oncall for a different G service that nearly paged on every error. It used the standard error budget tooling, but on hundreds of user buckets because the engineering around locality-specific configuration was... suspect. Many of these buckets had single-digits user. If a user was on their phone and lost signal, I was paged. Very poor oncall experience.
I worked at a large fintech moving billions of dollars in volume a day.
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
that is absolutely not the case for any system of size and scale. that would just burn out the on-call team and not result in improvements. Error rates/budgets are used instead.
It depends what you're monitoring. If it's response codes from user generated queries, then I'd agree with you.
But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.
Yes, Thais can be be really frustrating when you’re trying to get work done. There needs to be more competition and better alternatives and the LLMs need to offer easier connection to these alternatives.
Hah, I know the feeling. I installed Ubuntu on a PC recently, it obviously happened to be one of the days they got DDOSed and apt repos were unreachable. I had other things to take care of, so I put it aside for the next week or so. It didn't help very much, cause after picking it back up, halfway through, Snapcraft went down.
I vibe coded a script that interacts with both Gitlab and Github via their APIs and I've been using it pretty heavily since this morning. I crossed the streams! Goodness, I didn't know it would be _this_ bad!
Wasn't my fault! People flipped out couple weeks ago and thought I was gonna bring down Actions when I posted Ghostbox CLI - a tool to quickly configure and spin up runner machines for dev yer work.
The thread was insane - people totally misunderstanding and just snowballing in misrepresentative panick - what happened then was commenters lost it entirely when I posted what https://ghost.charity actually was, they couldn't accept they'd been wrong - they still believed it would "bring down GH actions", and projected that ghost was reselling Actions free minutes, and doing DDoS on Actions, when in reality was just configuring workflows on yours to make your own hybrid agent/human dev work clean and fast.
The panicked commenters were sure they were saving GitHub/MS and flagged the repo dozens of times until GitHUb auto-disabled it, and looks like GitHub/Microsoft still hasn't actually looked at it - so it's still autodisabled. Anyone work at GitHub?
So Actions goin' down - ain't my fault - despite HN's surety ghost was bad - it was, and is good! Embrace the agentic future!
Insane, we have to come up with contingency plans now for long-duration GitHub outages because we can't safely do deployments. For a service we're paying thousands of $ per year for even though we host runners ourselves...
Same thoughts - we use an action to ship to production, its builds an image, pushes it to ECS which triggers a deployment.
We can't be blocked here. Seems silly what we settled on this, but for a long time GitHub had been reliable enough for many years, but things are sliding down the pan as of late.
Same here. You’d think they could at least separate out the GitHub-hosted and self-hosted runners, so you’re still able to dispatch jobs if the self-hosted runners are down.
Depending on how many thousands of $ per year, it would probably be cheaper and more reliable to self-host GitLab. It's better in terms of organisational structure (you can have one, including access and secret inheritance), and (personal view) Gitlab-CI is better than GitHub Actions because it doesn't push you towards a JavaScript/NPM style dependency hell. And it's actually fairly easy to self-hosted, with options from a single machine with an omnibus package that handles everything to a full blown autoscaling Kubernetes deployment.
Whilst you're waiting for it to come back, try out AGENT-CI (which is a project I built.), which runs GitHub Actions on your machine: https://agent-ci.dev. (Open source, etc.)
No, it's not like "act," because it uses the standard Github runner, the difference is that the control plane is an emulation of api.github.com, because of this we can do all kinds of nice things:
Caching in ~0 ms. Pause on failure, so you can let your AI agent fix it and retry without pushing.
I did not say that, what I said was: It's not like `act` because it's not a rewrite of the runner. It's the standard runner... So the one that actually runs GitHub Actions.
I have tried to use act many times, and many times I've failed.
P.S. pause on failure is also helpful for humans, but I'm trying to be realistic about where the future of programming is going...
I had extremely bad experience trying to setup act on my Macbook. If this is something that actually works (and doesn't steal my credentials), I'm willing to try it despite AI non-features.
It's big enough that every time it goes down, it surely stops somebody from pushing fix for what they currently have broken, so I wonder if status page services see some kind of ripple from github outages.
About an hour ago I was having trouble browsing repo files in the browser and I thought "A disturbance in the force, is Github down?" Refreshed HN and loaded up their status site. Nada.
(Ofc, in a sensible universe, we just brush that off to a JS/Firefox glitch or my ISP.)
And yet, here I am. My code is not compiling, my AI isn't vibing, nonetheless I can't work! Two more hours before I can get off!
This gets posted every time GitHub is down. This chart is not accurate. It is based on data scraped from GitHub's status page and that data is missing historical incidents from the pre-Microsoft era.
Yeah, it’s not even consistent with their own incident history. I spot checked it and consistently found incidents with downtime/elevated error rates in months listed as 100.00000% uptime on that chart.
The unofficial and offical charts are both lying. The GitHub one ignores actual outages and the unofficial ones count minor display bugs in minor features as a “github outage”.
It's (a) they're under massively increased load because everyone's vibing up new projects these days, (b) they've been in a weird frankenstein "on azure but also we have our own control plane" state for years and they're pushing to no longer have that be the case.
I don't think vibecoding at Github has much to do with it.
I think it’s much more than 80%, it’s probably the default recommendation and folks who aren’t technical would just accept it. Probably closer to 95% or more
Your speculation is that their competitors would naturally not see a commensurate increase in instability while “only” handling 20% of the same crisis?
I don’t buy the excuse. I want to hitch my wagon to those “mysteriously lucky” competitors. (And have. And haven’t had similar issues to Github, since.)
Microsoft has boasted 30% of their code written by AI.[1] However we could only guess if AI generated code is the issue or something else, or a combination of things.
That being said there was a noticeable trend starting around 2022.[2] That being said they’ve also been doing a big migration to Azure. It’s likely a combination of things.
GitHub had a blog post about this recently. They reported a significant uptick in volume (repos created, PRs, etc.), which they attribute to AI usage and tooling.
It could be many things. Microsoft mismanaging stuff. Azure. Vibe-coded Github. So much AI slop being committed it adds an extra burden on the servers, etc.
Looks lik a terrible source. Like someone ran Claude on the codebase, didn't analyse the results, then vibe coded a blog post. And the dustri.org link doesn't work for me
I've been against self hosting internal tools for a long time mainly because of the devops and other overhead. But AI based devops makes it so easy now to spin up whatever you want now that I'm reconsidering that. I use a lot of ansible for several of our deployments. At this point, most of that is managed via codex.
For Git, all you technically need is ssh access and some backup strategy for your server. It would be bare bones but workable. And there are of course plenty of OSS things that are a lot nicer than that.
I'm still using gh and gh actions and we are mostly below the freemium layer with that. But it is kind of slow and honestly a dedicated vm plus some high CPU/memory workers we can spin up on a need to have basis might be a lot faster. With GH outages becoming more common, my hand might be forced a bit.
In recent weeks, I've spun up listmonk (mailing list solution), matrix (as a slack alternative), and a few other things specific to our software stack. A github alternative would be more of the same. We don't need a lot.
The main objection is that with more moving parts to worry about, the workload for me also increases. Things need updating, monitoring, backups, alerting (and responding to alerts), etc. That sucks up my time and that is scarce.
Another reason for self hosting these days is that with agentic AI tools, self hosted things are a lot easier to integrate into agentic systems. If it is self hosted, you don't have to worry about API limitations, rate limitations, walled gardens, etc. All the traditional SAAS silos are becoming a problem from that point of view. The more locked down it is, the bigger the motive for moving away from it. That's why we ditched Slack for Matrix. Slack is hopelessly locked down and tedious to deal with. Matrix is super easy for this.
Are there any GitHub Actions-compatible CI services out there that don't rely on their infrastructure? I know of depot's but no others; are these resilient to these outages or do they still lose functionality? I imagine the latter but I don't know.
Founder of Depot here. To my knowledge, we are the first engine to support different syntaxes in this compatible way via Depot CI [0]. Great time to try it out and let us know your thoughts! We’ve built a lot of cool stuff into it like parallel steps, custom images, and a full CLI/API interface so you can literally everything without going into the web app.
Is there a tier for open source organizations? Do I have to admin any of AWS that runs behind the scenes or can I pay a fixed price to depot and get it to solve everything out of my way?
I used to use Cirrus CI as an alternative to GitHub Actions and am looking for a new alternative. I wonder if Depot could fit in the same way for my needs. I need to run builds and tests in Windows, Linux and macOS.
As someone who partially uses depot but was still affected by this github issue, we obviously haven't moved over enough. We use your runners but github is still blocking us.
Hope you don't mind the public ask, it seems useful for others.
If we're using depot runners, and want to use them directly, or move off of github actions being the controller for when things run: what do you suggest?
Yes, triggering Depot CI via the CLI is the sure fire way to avoid all dependencies on GitHub.
We’d need more details around what you’re seeing. It is true that if auth across GitHub is broken than we can’t copy your actions out to be used by Depot CI. However, we have a solution in the works for that as well.
In short, Depot CI, our own engine and control plane is not dependent on upstream actions control plane. But still has to listen for commit events to know if/when to run jobs on things like PRs. This to is being removed in the future.
Are you able to bring your own runners? Our org is heavily invested in self-hosted runners at this point and have gotten a pretty tremendous value from it. I think we'd be wise to get away from GitHub's control plane but keep running jobs in our own infra.
We currently use external runners (Blacksmith.sh), but that didn't shield us from this as GitHub actions is still the control plane for triggering and monitoring them.
We're now considering Buildkite (apparently they have a GH actions migration tool) or self hosting something (GitLab CI, maybe even Jenkins), as it looks like that would've kept ticking over since we're still seeing webhooks being triggered today during the downtime.
github actions themselves can be self hosted, its quite nice actually to be able to keep your same patterns as cloud hosted actions and with one line change to the yaml have it running on your own hardware. I do this for actions that take 6-7 hours so I am not burning through the 3000 minutes that come free with my account.
Github measures/reports the SLA of the individual services.
The external page linked above goes the other extreme and considers it a bad status whenever any individual service is degraded.
In reality the majority of people only use 3 or 4 of the core services the majority of the time but since there's no "core services" SLA/uptime the usability of github for the majority of people is slightly obfuscated.
This is great because I finally set up Actions yesterday for a new project of mine and of course it’s failing today and thinking I screwed up the yaml.
`github-actions[bot]` was disabled for some time, if that's the actor which does the checkout in your setup it could be related. FWIW it's back to working now.
If you don't want to self-host Gitea/Forgejo, I recommend SourceHut for private repos and Codeberg for public ones. Happy to answer any questions you might have for either based on my experience!
free service is down again, let's everyone that use the service for free complain again!!! (sorry for the sarcastic comment but i find it crazy how people feel they are entitled when it's free)
We use TeamCity for CI builds, before that Jenkins. Only accessible from the inside of the network.
Even though it's selfhosted and we don't have a dedicated infrastructure team, I don't remember it ever being down in the last 12 years I have been working here.
Shout out to all my SF 5am crew checking if their overnight prs passed CI. Real 597 “member of technical staff” energy. I guess we should expect this, it is a Tuesday!
The future of SRE will be the company putting some amount of money on a prediction market against the site going down and you get to take home the winnings as long as the site stays up.
oh man spent so much time trying to debug what's going on. I have a complex setup with GitHub Actions and self hosted runners so I thought it's something broken in my CI setup
Another outage at GitHub with actions and pages not working thanks to the AI agents Copilot and Tay.ai creating more issues. Last time this happened was 6 days ago. [0]
This time today it was caused by friendly fire by the automatic suspension of the GitHub Actions bot which is now a "Ghost" user. Since there is no CEO of GitHub to contact it we are just going to see more [1] of this again.
You might need to push a critical change soon, but now you cannot. You won't get any of these issues if you self hosted as I said 6 years ago...[2]
My action failed with "Unexpected error fetching GitHub release for tag refs/heads/master: HttpError: Sorry. Your account was suspended"
Which certainly made me shit myself, briefly.
A brownout redefined.
Same. It's weird how I always find out that GitHub is down before GitHub does. Took 15 minutes before it appeared on githubstatus.com
More likely that 'update the Status site' lives a long way down their incident response plan, and they have alarms going off well before that
yeah I mean a company the size of GitHub certainly can’t be expected to have enough staff to walk and chew gum at the same time
All these monitoring rules are of the format "when 500 errors > baseline for x minutes". Otherwise you'd have monitoring alerts every second. So it is normal for users to already see errors before github officially counts it as an outage.
You'd expect them to be monitoring more than just the HTTP response codes from user requests for precisely this reason.
If the first they hear of an outage is when user requests start to fail, then that's a failure in their monitoring as well.
But effective monitoring is harder than people assume.
In a high performance service with good maintenance and upkeep, you page for all 500s. A noisy pager forces the team to fix the 500s.
Maybe the Github Actions infrastructure isn't run like that.
Im curious about this: because in my experience (working on smaller services though), a small number of errors is always there, as a "baseline".
Recently there was this: https://news.ycombinator.com/item?id=47252971 "10% of Firefox crashes are caused by bitflips"
Which makes me think a small amount of random issues which happen even though nothing is broken, is normal everywhere. Especially once move things around on a network, there's potential for a lot more random errors.
Bitflips are something that can happen in consumer-grade RAM, so that tracks (and it's comforting that wayward cosmic rays are a substantial reason for an application's crashes!), but on enterprise servers, they will run ECC RAM that is very resistant to bit flips.
This is why data hoarders who have NASes with lots of space insist on running their servers with ECC RAM despite it being significantly more expensive. Because bit flips, for all intents and purposes, cannot happen. The RAM itself detects and corrects for them.
I wouldn't expect bit flips to be a significant contributor to enterprise problems.
You've completely missed the point - It's not about bitflips it's about errors that are outside the scope of what's fixable.
Bitflips specifically may not be; things like network issues, noisy neighbors, row/rack/host maintenance (leading to a downed and migrated host) absolutely are things that happen at high frequency at scale and cause your background level of errors to be more than 0.
Do you know of a single service at a single company that actually does that?
I know all of Gmail, every GCE service I can think of, every AWS service I can think of, Amazon.com, Netflix, and Github all do not page on just a single 500.
I know none of those are particularly "high performance" though. Curious where your experience is coming from.
I've been oncall for a different G service that nearly paged on every error. It used the standard error budget tooling, but on hundreds of user buckets because the engineering around locality-specific configuration was... suspect. Many of these buckets had single-digits user. If a user was on their phone and lost signal, I was paged. Very poor oncall experience.
I worked at a large fintech moving billions of dollars in volume a day.
I had a fairly long tenure, where I maintained multiple key services in critical online payments flow. Authentication, authorization, core business and risk data, as well as some cross-cutting control plane stuff, etc. You needed one or more of our services to take a payment, serve any request from the employee dashboard - pretty much everything hit our services. The entire company ground to a halt without my team.
We paged for every single 500. In instances where a particular class of 500 was spurious or not worth fixing, we would leave it acked or mark it as noise. But typically we'd just put in a fix as soon as possible so we didn't page.
Our graceful shutdown and traffic shaping stack was great, but occasionally we'd get a few pages during deploys or failovers.
Oncall was typically not bad, but when it did get bad it was terrible. I've been involved in huge outages that cost hundreds of millions of dollars. Usually it was the fault of multiple teams having compounding runaway failures rather than one service or bug in particular.
It's inexcusable to have a customer's payments not go through. We engineered around resilience. We had strict five nines SLAs and p99 targets and evaluated our adherence with even the smallest partial outage. Hundreds of other services depended on ours, and downstream impacts were huge, so we had to keep a tight ship.
> We paged for every single 500.
Assuming the existence of some kind of network (with zero guarantee of 100% reliability), how does this work in practice? Is each 500 treated as an event that needs investigation, even if the result of that would end up as 'a router dropped something from an internal buffer but the transaction as a whole was re-tried by a parent so the service itself recovered'?
that is absolutely not the case for any system of size and scale. that would just burn out the on-call team and not result in improvements. Error rates/budgets are used instead.
It depends what you're monitoring. If it's response codes from user generated queries, then I'd agree with you.
But if it is synthetic queries sent from the monitoring platform, then you control the user agent, payload, and endpoints. So any failed requests are a symptom of a misconfiguration and/or failure that should be investigated. Albeit not necessarily as a P1 priority.
forget it, Jake; it’s Azure
> It's weird how I always find out that GitHub is down before GitHub does
No, it's not. Official updates = potential SLA penalties. Always requires approval.
Yes, Thais can be be really frustrating when you’re trying to get work done. There needs to be more competition and better alternatives and the LLMs need to offer easier connection to these alternatives.
What do the Thai people have to do with this? :(
Pretty sure that they wanted to write "this", typed something different by accident, and auto-correct struck.
Reminded me of the "Thai Fighter" joke from family guy's star wars spoof lol
Wasn’t my fault this time! I haven’t started work yet.
https://news.ycombinator.com/item?id=47237377
Hah, I know the feeling. I installed Ubuntu on a PC recently, it obviously happened to be one of the days they got DDOSed and apt repos were unreachable. I had other things to take care of, so I put it aside for the next week or so. It didn't help very much, cause after picking it back up, halfway through, Snapcraft went down.
Yeah but you thought about it, didn’t you?
I did....maybe my powers are growing.
Sorry guys it might be me.
I vibe coded a script that interacts with both Gitlab and Github via their APIs and I've been using it pretty heavily since this morning. I crossed the streams! Goodness, I didn't know it would be _this_ bad!
Next thing you're gonna tell us you're SRE at GitHub.
Uh oh. That means there's at least one more like you out there that we don't know about.
I always wanted superpowers, but I never dreamed it'd be like this.
- So many super-heroes/super-villains
Was about to send my bill to you.
... You're off the hook this time./s
Wasn't my fault! People flipped out couple weeks ago and thought I was gonna bring down Actions when I posted Ghostbox CLI - a tool to quickly configure and spin up runner machines for dev yer work.
The thread was insane - people totally misunderstanding and just snowballing in misrepresentative panick - what happened then was commenters lost it entirely when I posted what https://ghost.charity actually was, they couldn't accept they'd been wrong - they still believed it would "bring down GH actions", and projected that ghost was reselling Actions free minutes, and doing DDoS on Actions, when in reality was just configuring workflows on yours to make your own hybrid agent/human dev work clean and fast.
The panicked commenters were sure they were saving GitHub/MS and flagged the repo dozens of times until GitHUb auto-disabled it, and looks like GitHub/Microsoft still hasn't actually looked at it - so it's still autodisabled. Anyone work at GitHub?
So Actions goin' down - ain't my fault - despite HN's surety ghost was bad - it was, and is good! Embrace the agentic future!
Insane, we have to come up with contingency plans now for long-duration GitHub outages because we can't safely do deployments. For a service we're paying thousands of $ per year for even though we host runners ourselves...
Same thoughts - we use an action to ship to production, its builds an image, pushes it to ECS which triggers a deployment.
We can't be blocked here. Seems silly what we settled on this, but for a long time GitHub had been reliable enough for many years, but things are sliding down the pan as of late.
Sounds like a very easy process to rewrite in bash/python and have it on hand if needed.
It is a control pain
./deploy.sh
Same here. You’d think they could at least separate out the GitHub-hosted and self-hosted runners, so you’re still able to dispatch jobs if the self-hosted runners are down.
If the job queue is down, that wouldn't help, would it?
On my repo the jobs do not get scheduled on the PRs at all, so I assume that separation wouldn't help for todays issue.
[delayed]
Depending on how many thousands of $ per year, it would probably be cheaper and more reliable to self-host GitLab. It's better in terms of organisational structure (you can have one, including access and secret inheritance), and (personal view) Gitlab-CI is better than GitHub Actions because it doesn't push you towards a JavaScript/NPM style dependency hell. And it's actually fairly easy to self-hosted, with options from a single machine with an omnibus package that handles everything to a full blown autoscaling Kubernetes deployment.
Sounds good until you see their cvedetails page
Hide it behind VPN, so it's not accessible from outside.
I mean, the GitHub Actions supply chain risks and attacks definitely compensate for any GitLab security vulnerabilities you can think of.
> For a service we're paying thousands of $ per year for even though we host runners ourselves...
Wait until you charge you for self-hosting runners.
Oh wait. They already tried.
Whilst you're waiting for it to come back, try out AGENT-CI (which is a project I built.), which runs GitHub Actions on your machine: https://agent-ci.dev. (Open source, etc.)
No, it's not like "act," because it uses the standard Github runner, the difference is that the control plane is an emulation of api.github.com, because of this we can do all kinds of nice things:
Caching in ~0 ms. Pause on failure, so you can let your AI agent fix it and retry without pushing.
You're affiliated with the project. You should definitely be upfront about that when shilling.
You're right, figured it was implied, but now fixed.
"Its not like act, because we can add AI"
Is what it boils down to.
> codex "Fix this pipeline, use `act` to verify your changes"
I did not say that, what I said was: It's not like `act` because it's not a rewrite of the runner. It's the standard runner... So the one that actually runs GitHub Actions.
I have tried to use act many times, and many times I've failed.
P.S. pause on failure is also helpful for humans, but I'm trying to be realistic about where the future of programming is going...
I had extremely bad experience trying to setup act on my Macbook. If this is something that actually works (and doesn't steal my credentials), I'm willing to try it despite AI non-features.
What I don’t get about this is how you run OS specific tasks (Windows, macOS, Linux)..
I started playing with proxmox VMs and containers in them (docker and tart) to see if I can build some local infrastructure to properly solve this…
We support macOS via tartlet, but basically it's always linux. If you need windows then it's gonna be an issue.
The jobs runs via containers.
How to kill a busniess 101. The brand damage to business and owner is incalculable.
Incredible how reliable the heuristic of "something seems off - probably github being down" has gotten these days
It's big enough that every time it goes down, it surely stops somebody from pushing fix for what they currently have broken, so I wonder if status page services see some kind of ripple from github outages.
About an hour ago I was having trouble browsing repo files in the browser and I thought "A disturbance in the force, is Github down?" Refreshed HN and loaded up their status site. Nada.
(Ofc, in a sensible universe, we just brush that off to a JS/Firefox glitch or my ISP.)
And yet, here I am. My code is not compiling, my AI isn't vibing, nonetheless I can't work! Two more hours before I can get off!
https://www.dayswithoutgithubincident.com
Why do they go down so often? Is it true that the reason is that they've incorporated too much AI without human review?
The instability started well before vibecoding, in around 2018-2019, shortly after the Microsoft acquisition.
https://damrnelson.github.io/github-historical-uptime/
https://news.ycombinator.com/item?id=47591928
This gets posted every time GitHub is down. This chart is not accurate. It is based on data scraped from GitHub's status page and that data is missing historical incidents from the pre-Microsoft era.
Yeah, it’s not even consistent with their own incident history. I spot checked it and consistently found incidents with downtime/elevated error rates in months listed as 100.00000% uptime on that chart.
The unofficial and offical charts are both lying. The GitHub one ignores actual outages and the unofficial ones count minor display bugs in minor features as a “github outage”.
It's (a) they're under massively increased load because everyone's vibing up new projects these days, (b) they've been in a weird frankenstein "on azure but also we have our own control plane" state for years and they're pushing to no longer have that be the case.
I don't think vibecoding at Github has much to do with it.
Ah, yes. A lot more repos, commits, and most importantly huge PRs.
That makes sense. Thank you!
No, it doesn’t. Their competition is not similarly unstable, despite existing in the same world of LLMs. Think critically.
Devil’s advocate, Pareto heuristic would let us speculate that 80% of LLM traffic would be aimed directly at the largest provider, i.e. GitHub.
I think it’s much more than 80%, it’s probably the default recommendation and folks who aren’t technical would just accept it. Probably closer to 95% or more
Your speculation is that their competitors would naturally not see a commensurate increase in instability while “only” handling 20% of the same crisis?
I don’t buy the excuse. I want to hitch my wagon to those “mysteriously lucky” competitors. (And have. And haven’t had similar issues to Github, since.)
Microsoft has boasted 30% of their code written by AI.[1] However we could only guess if AI generated code is the issue or something else, or a combination of things.
That being said there was a noticeable trend starting around 2022.[2] That being said they’ve also been doing a big migration to Azure. It’s likely a combination of things.
1: https://www.cnbc.com/2025/04/29/satya-nadella-says-as-much-a...
2: https://www.reddit.com/r/sysadmin/s/LOMPaSv3wY
GitHub had a blog post about this recently. They reported a significant uptick in volume (repos created, PRs, etc.), which they attribute to AI usage and tooling.
Do you really believe their competition hasn’t seen the same increase? Because their competition certainly hasn’t seen the same instability issues.
Yes, I truly believe that GitHub is recommended by an LLM orders of magnitude more frequently than any other forge
This plus in a well-designed system an increase in load might cause new jobs to stop running but shouldn't take down the whole system.
I personally trigger github actions approximately 50x more than I did prior to AI-driven developer coding and I'm not alone.
Totally agree. There's days (or even afternoons) where I trigger more actions than I would have done in a month.
Okay so the recent outages are also likely due to increased load due to AI assisted development speeding up workflows.
It could be many things. Microsoft mismanaging stuff. Azure. Vibe-coded Github. So much AI slop being committed it adds an extra burden on the servers, etc.
I moved a while back to Forgejo -> https://forgejo.org couldn't be happier. Highly recommended.
Looks good, but I'm not sure about security: https://bearyangry.com/2026/04/29/carrot-disclosure-forgejo-...
Looks lik a terrible source. Like someone ran Claude on the codebase, didn't analyse the results, then vibe coded a blog post. And the dustri.org link doesn't work for me
Anyway. Forgejo's response to it: https://floss.social/@forgejo/116494295922963052
I've been against self hosting internal tools for a long time mainly because of the devops and other overhead. But AI based devops makes it so easy now to spin up whatever you want now that I'm reconsidering that. I use a lot of ansible for several of our deployments. At this point, most of that is managed via codex.
For Git, all you technically need is ssh access and some backup strategy for your server. It would be bare bones but workable. And there are of course plenty of OSS things that are a lot nicer than that.
I'm still using gh and gh actions and we are mostly below the freemium layer with that. But it is kind of slow and honestly a dedicated vm plus some high CPU/memory workers we can spin up on a need to have basis might be a lot faster. With GH outages becoming more common, my hand might be forced a bit.
In recent weeks, I've spun up listmonk (mailing list solution), matrix (as a slack alternative), and a few other things specific to our software stack. A github alternative would be more of the same. We don't need a lot.
The main objection is that with more moving parts to worry about, the workload for me also increases. Things need updating, monitoring, backups, alerting (and responding to alerts), etc. That sucks up my time and that is scarce.
Another reason for self hosting these days is that with agentic AI tools, self hosted things are a lot easier to integrate into agentic systems. If it is self hosted, you don't have to worry about API limitations, rate limitations, walled gardens, etc. All the traditional SAAS silos are becoming a problem from that point of view. The more locked down it is, the bigger the motive for moving away from it. That's why we ditched Slack for Matrix. Slack is hopelessly locked down and tedious to deal with. Matrix is super easy for this.
Are there any GitHub Actions-compatible CI services out there that don't rely on their infrastructure? I know of depot's but no others; are these resilient to these outages or do they still lose functionality? I imagine the latter but I don't know.
Founder of Depot here. To my knowledge, we are the first engine to support different syntaxes in this compatible way via Depot CI [0]. Great time to try it out and let us know your thoughts! We’ve built a lot of cool stuff into it like parallel steps, custom images, and a full CLI/API interface so you can literally everything without going into the web app.
[0] https://depot.dev
Is there a tier for open source organizations? Do I have to admin any of AWS that runs behind the scenes or can I pay a fixed price to depot and get it to solve everything out of my way?
I used to use Cirrus CI as an alternative to GitHub Actions and am looking for a new alternative. I wonder if Depot could fit in the same way for my needs. I need to run builds and tests in Windows, Linux and macOS.
As someone who partially uses depot but was still affected by this github issue, we obviously haven't moved over enough. We use your runners but github is still blocking us.
Hope you don't mind the public ask, it seems useful for others.
If we're using depot runners, and want to use them directly, or move off of github actions being the controller for when things run: what do you suggest?
Trigger the workflows directly on depot via CLI?
Yes, triggering Depot CI via the CLI is the sure fire way to avoid all dependencies on GitHub.
We’d need more details around what you’re seeing. It is true that if auth across GitHub is broken than we can’t copy your actions out to be used by Depot CI. However, we have a solution in the works for that as well.
In short, Depot CI, our own engine and control plane is not dependent on upstream actions control plane. But still has to listen for commit events to know if/when to run jobs on things like PRs. This to is being removed in the future.
Are you able to bring your own runners? Our org is heavily invested in self-hosted runners at this point and have gotten a pretty tremendous value from it. I think we'd be wise to get away from GitHub's control plane but keep running jobs in our own infra.
Yes, we support this via Depot Managed for all of our products including the latest one: Depot CI [0].
[0] https://depot.dev/products/ci
We currently use external runners (Blacksmith.sh), but that didn't shield us from this as GitHub actions is still the control plane for triggering and monitoring them.
We're now considering Buildkite (apparently they have a GH actions migration tool) or self hosting something (GitLab CI, maybe even Jenkins), as it looks like that would've kept ticking over since we're still seeing webhooks being triggered today during the downtime.
Try Depot CI as well. Supports a GHA syntax but the entire control plane is ours with our own engine.
github actions themselves can be self hosted, its quite nice actually to be able to keep your same patterns as cloud hosted actions and with one line change to the yaml have it running on your own hardware. I do this for actions that take 6-7 hours so I am not burning through the 3000 minutes that come free with my account.
Self-hosted action runners are not working too right now.
This isn't resilient to this downtime though. Our self-hosted runners are currently not functioning because of some github dependency.
what kind of actions take that long? some kind of compilation task / gigantic test suite ala SQLite?
there are a couple and have very good reputation - though I've never used them
https://www.blacksmith.sh/ and https://runs-on.com/
They also say that they're much cheaper than github
I think both of these provide nodes that are scheduled using GitHub's control plane. They would also not be working right now.
I switched to GitLab a while ago and then spun it up locally.
Something’s wrong when my own infrastructure is more reliable than Microsoft’s.
Someone said GitHub is racing to the mythical "zero nines of availability" and I love it
Hmm... 88.8888888%?
Jesus, that's both horrible and seems within reach.
Yep, they just need to improve their reliability by 2%!
https://mrshu.github.io/github-statuses/
This page tells a very different story from GitHub own status page. What is different here?
Github measures/reports the SLA of the individual services.
The external page linked above goes the other extreme and considers it a bad status whenever any individual service is degraded.
In reality the majority of people only use 3 or 4 of the core services the majority of the time but since there's no "core services" SLA/uptime the usability of github for the majority of people is slightly obfuscated.
They've already been well below that over the last 90 days
GitHub Actions outage sparked direct-action, class-action, mass non-action, and widespread dis-satis-faction.
Just post here when its up. Its easier...
"Microsoft’s GitHub was positioned to win the AI coding race. Outages got in the way" - https://www.cnbc.com/2026/05/22/microsoft-was-positioned-to-...
This is great because I finally set up Actions yesterday for a new project of mine and of course it’s failing today and thinking I screwed up the yaml.
Yeah I'm getting an error where it says account has been suspended. They really are becoming an embarassment
this has happened to me too. i am guessing then it is not a real reason?
`github-actions[bot]` was disabled for some time, if that's the actor which does the checkout in your setup it could be related. FWIW it's back to working now.
I don't understand anyone still using github for anything unless they have to or have payed for it. Move literally anywhere else
If you don't want to self-host Gitea/Forgejo, I recommend SourceHut for private repos and Codeberg for public ones. Happy to answer any questions you might have for either based on my experience!
What could be the cause of GitHub issues from an engineering perspective?
free service is down again, let's everyone that use the service for free complain again!!! (sorry for the sarcastic comment but i find it crazy how people feel they are entitled when it's free)
I was actually shocked when I saw what our org pays for Github, not cheap and defo not free
We pay github quite a bit of money and it's down for us too
It’s down for companies too, if your company org is using GitHub enterprise too.
There are plenty of paying enterprise users that are also affected.
It's so weird because github used to be known for rock solid stability and now the entire reputation has changed.
You must be new. github was never that stable.
This is outrageous. Someone go create a Polymarket.
Please don't. These "prediction markets" are a scourge upon mankind.
I've started spending each github outage planning our move to an alternative. I guess I'm not alone. Where are you all moving?
We use TeamCity for CI builds, before that Jenkins. Only accessible from the inside of the network.
Even though it's selfhosted and we don't have a dedicated infrastructure team, I don't remember it ever being down in the last 12 years I have been working here.
I wonder if these github failures are just systematic incompetence or MS cutting budget on purpose to promote its own cicd tools
Or possibly an elevated number of AI Slop Cannons aiming their LLM generated hallucinations at github hosted repos?
Shout out to all my SF 5am crew checking if their overnight prs passed CI. Real 597 “member of technical staff” energy. I guess we should expect this, it is a Tuesday!
And it is bypassing mandatory GHA Pipeline check and giving green. So be careful when merging/reviewing your PRs cause.
Hey at least Copilot AI Model Providers have 100% uptime, so there's that
I have fun somebody imaging somebody internally explaining that this is a heavy traffic page and we should use it to increase reach.
No way - everyone tells me the AI adoption is going great?
LoL they added "Copilot AI Model Providers" in githubstatus and it has 100% up time.
Thanks for pointing out that nobody is using that thing
This has become so typical that we've started working on a modern Github alternative called Plain.
Perfect timing that we post https://www.jxd.dev/writing/building-plain just as this latest incident started.
Contingency action plan: Codeberg. Engage.
Does anyone use any good alternatives to GitHub Actions?
How's the AI generated code running for ya?
I think we should start betting if GitHub will be down on Polymarkets or something at this point.
The future of SRE will be the company putting some amount of money on a prediction market against the site going down and you get to take home the winnings as long as the site stays up.
Tell Claude to fix it, simple.
List of things "DoS"d by AI:
- GitHub
- Hiring budgets
- RAM (/personal computing in general)
- Electricity
- Media/Content
- Truth
Has anyone actually moved off? If so where?
I like being able to vote with my (teams) wallet and I'm tired of staying out of convenience
I moved to Codeberg and self hosted Forgejo. I'm happy.
'Degraded' should be banned in status pages. It sounds just irresponsible, like "Yeah, it can be slow or something sometime. Whatever. Who cares"
Straight-up, "degraded" should strictly mean "may be slower, or so slow it randomly fails" on these kinds of status pages.
The whales are all dying, and we don't know why. Well, some are still alive for now though so maybe it's not so bad...
How would you call "available, but only sometimes"?
oh man spent so much time trying to debug what's going on. I have a complex setup with GitHub Actions and self hosted runners so I thought it's something broken in my CI setup
Ugh, same. 30 mins with 2 devs trying to figure it out before they posted an update.
Now PRs are piling up! https://github.com/mohsen1/tsz/pulls
It should be up again
i still can't see many pull requests in a bunch of repositories... it's been over a month
Super odd make productivity useless
Zero Nines. Bogus.
When is it up?
Github is more likely to be up before noon in UTC timezone. i.e. before the majority of US users are online and causing load.
Or maybe it's before the GitHub internal devs are online and deploying changes.
my work is totally stop. cry
microsoft github should work at restoring interop with noscript/basic HTML browsers...
I agree, but that's not at all related to this outage.
Too many times we've been bitten by this - it has been an issue too many times to count.
This is why we don't use Github Actions, kids.
Seriously, its a proprietary build service that puts the keys to the kingdom in someone elses' control. Just: No!
Print this status page to PDF so you've got it handy next time someone castigates you for not using Github Actions, folks.
So, what do you use?
Another outage at GitHub with actions and pages not working thanks to the AI agents Copilot and Tay.ai creating more issues. Last time this happened was 6 days ago. [0]
This time today it was caused by friendly fire by the automatic suspension of the GitHub Actions bot which is now a "Ghost" user. Since there is no CEO of GitHub to contact it we are just going to see more [1] of this again.
You might need to push a critical change soon, but now you cannot. You won't get any of these issues if you self hosted as I said 6 years ago...[2]
[0] https://www.githubstatus.com/incidents/g6ffrm0rfvz9
[1] https://news.ycombinator.com/item?id=48085501
[2] https://news.ycombinator.com/item?id=22867803
lol
https://github.blog/changelog/2026-05-15-github-app-installa...
I'm guessing related to this? The blog post is dated 11 days ago but I just noticed a blue banner on my actions page today.