It took about six hours, a new record for Facebook downtime, but Facebook is finally back up. What happened? Here’s what we know so far.
The old network troubleshooting saying is, when anything goes wrong, “It’s DNS.” This time Domain Name Server (DNS) appears to be the symptom of the root cause of the Facebook global failure. The true cause is that there are no working Border Gateway Protocol (BGP) routes into Facebook’s sites.
BGP is the standardized exterior gateway protocol used to exchange routing and reachability information between the internet top-level autonomous systems (AS). Most people, indeed most network administrators, never need to deal with BGP.
Many people spotted that Facebook was no longer listed on DNS. Indeed, there were joke posts offering to sell you the Facebook.com domain.
Cloudflare VP Dane Knecht was the first to report the underlying BGP problem. This meant, as Kevin Beaumont, former Microsoft’s Head of Security Operations Centre, tweeted, “By not having BGP announcements for your DNS name servers, DNS falls apart = nobody can find you on the internet. Same with WhatsApp btw. Facebook have basically deplatformed themselves from their own platform.”
As annoying as this is to you, it may be even more annoying to Facebook employees. There are reports that Facebook employees can’t enter their buildings because their “smart” badges and doors were also disabled by this network failure. If true, Facebook’s people literally can’t enter the building to fix things.
In the meantime, Reddit user u/ramenporn, who claimed to be a Facebook employee working on bringing the social network back from the dead, reported, before he deleted his account and his messages, that “DNS for FB services has been affected and this is likely a symptom of the actual issue, and that’s that BGP peering with Facebook peering routers has gone down, very likely due to a configuration change that went into effect shortly before the outages happened (started roughly 1540 UTC).”
He continued, “There are people now trying to gain access to the peering routers to implement fixes, but the people with physical access is separate from the people with knowledge of how to actually authenticate to the systems and people who know what to actually do, so there is now a logistical challenge with getting all that knowledge unified. Part of this is also due to lower staffing in data centers due to pandemic measures.”
Ramenporn also stated that it wasn’t an attack, but a mistaken configuration change made via a web interface. What really stinks — and why Facebook is still down hours later — is that since both BGP and DNS are down, the “connection to the outside world is down, remote access to those tools don’t exist anymore, so the emergency procedure is to gain physical access to the peering routers and do all the configuration locally.” Of course, the technicians on site don’t know how to do that and senior network administrators aren’t on site. This is, in short, one big mess.
Facebook was not immediately forthcoming about what had gone wrong and how it was fixed. Hours after Facebook and all its related services went down, Facebook CTO Mike Schroepfer tweeted: “We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible.” Afterward, as Facebook started to come up, he added, “Facebook services coming back online now – may take some time to get to 100%. To every small and large business, family, and individual who depends on us, I’m sorry.”
As a former network admin who worked on the internet at this level, I anticipated Facebook would be down for hours. I was also right that it would prove to be Facebook’s longest and most severe failure to date. I do wonder about exactly what went wrong and how it was fixed. Stay tuned. We’ll report on that as soon as know more details.