I mean, their viz is free and straight forward, not hidden behind a paywall or a demo page. I also appreciate not putting any comment based signal indicators as that is often noise.
There is still a tendency within some parts of aviation (safety auditing) to look for root causes and use tools like "fish bone diagrams" despite the more holistic approach used after an actual crash or incident.
A bunch of different services on a single status page doesn’t make it a complex system. Most of these have no relation to each other other than the high level services on the cloud providers.
> A bunch of different services on a single status page doesn’t make it a complex system.
you're it does not.
> Most of these have no relation to each other other than the high level services on the cloud providers.
so, some of them are related to each other? some of them even share underlying infrastructure? perhaps multiple of these are considered infrastructure for some teams?
This app looks to be incorrectly parsing Slack and Auth0 official status page and showing incidents as ongoing that are not
And those are just the 2 that I checked.
To be fair, accurately scraping and normalizing data from status pages is really hard to to do consistently (my company has a team of 5 engineers to do it and it's a lot of work).
Services like Cloudflare and Twilio have so many POPs globally that one or more always have an outage going on. Then there's the question of whether it's a major outage or a minor outage. Even though major status page providers like Atlassian and Incident.io have public status APIs (Cloudflare uses Atlassian), it takes more than just parsing them to determine what is "down" and at what granularity.
I run an outage detection service - and some of these issues, like parsing hundreds of - sometimes undocumented - status APIs, make for an interesting engineering problem.
With these guys you get into a weird world of "is it them, us, or upstream of both of us" all the time. I had been using Twilio's telco partner maintenance notifications as a way of figuring out if someone like Orange was responsible for a bunch of French end points independent of Twilio had network degradation.
Correlated downtime and this is a place I wouldn't actually mind a guess from AI on whether their is a common underlying cause between some of the things. I say AI because I don't really think anyone is going to keep all of the possible common dependencies of different privately hosted systems up to date, but AI could at least take an initial guess + try to find if anyone else is posting root cause theories elsewhere at the time and link to those (and a guess is fine enough).
Where does this draw data from? It's a similar visual concept to what we're doing at ThousandEyes within Internet Insights (see https://www.thousandeyes.com/outages/) however we make it fairly clear how we are making these determinations. Our data comes from billions of daily pseudonymous metrics from within synthetic tests running across thousands of agents around the world.
If you're drawing the data from a public resource like downdetector or using the sites status pages, then you may not be reflecting reality, but it should be clear what the provenance of the data is.
Well if you count every minor service outage which maybe 0.1% of the users are non-critically affected by, you quickly get to 0.6%. So, this doesn't really tell you anything.