Facebook’s servers were down globally for nearly six hours on Monday. Its internal systems were down, too. Experts point to an update to its Border-Gateway Protocol (BGP) as a possible cause for the outage.
Story so far: Facebook Inc.’s services suffered a massive outage on Monday for as long as six hours. It kept several users from accessing the company’s core platforms like WhatsApp, Instagram and Messenger apps. It also disrupted businesses around the world that rely on the social network’s tools and services. As Facebook manages its own internal tools and email service, the company’s employees were also unable to access work-related applications.
While apps and websites suffering outages is common, hours-long global disruption in rare. Facebook said late Monday (U.S. time) “That configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted” network traffic. The company’s head of engineering and infrastructure, Santosh Janardhan, noted in a blog that the outage affected Facebook’s internal systems, making it harder to restore access.
Facebook’s services are now back online, but after an outage like this, it could take several more hours for the system and network to be completely restored. In the meantime, networking experts are pointing to an update to Border-Gateway Protocol (BGP) as a possible cause for the outage.
(Sign up to our Technology newsletter, Today’s Cache, for insights on emerging themes at the intersection of technology, business and policy. Click here to subscribe for free.)
BGP at the heart of the outage
On Monday, someone had given Facebook a magic potion that made it virtually invisible. That’s why when users tried logging into the company’s applications and websites, they couldn’t find the pages. Their searches returned an error that ‘This site can’t be reached’.
To understand why this happened, one needs to know that the Internet is simply a network of networks. And all of these networks are bound together by Border-Gateway Protocol (BGP). And BGP lets one network know it is available to the others. Facebook is one such network, and it advertises its presence to other networks. This enables Internet service providers across the world to route web traffic to different networks via BGP process.
In the case of Facebook, an update to the BGP removed its online properties from being available to world’s computers. This means the social network’s Domain Name System (DNS) was not accessible to other networks, and the Internet.
Web infrastructure firm Cloudflare keeps track of BGP updates and announcements at a global scale. They have an overall view of how the Internet is connected and where the web traffic flows from. And any time a change is made to a network’s BGP, be it an announcement or withdrawal, a message is sent to a router informing the update.
And normally, this is “fairly quiet” for Facebook as the company doesn’t make a lot of changes minute by minute, Cloudflare said in its blog. “But at around 15:40 GMT we saw a peak of routing changes from Facebook. That’s when the trouble began.”
Knock on effect on DNS
So, the web infrastructure company split the routes announcements and withdrawals from Facebook to get a clearer picture of what happened. They noticed that the routes were withdrawn, sending Facebook’s DNS servers offline. And the withdrawals meant Facebook and its websites were effectively out of sight from world’s computers.
This happened because DNS is like a translation service for IP addresses. And when a DNS resolver fails to translate a domain name into an IP address, people won’t be able to access that specific website. As a direct consequence, the webpage won’t load.
As a work-around in such cases, a DNS resolver usually checks whether it has something in its cache and uses it to establish contact. And if that doesn’t work, it tries to get a connection with the domain nameservers, one hosted by the network itself (Facebook in this case).
Both these mechanisms failed in Facebook’s case as the social network stopped announcing its routes through BGP, making it impossible for everyone’s DNS resolvers to connect to Facebook’s nameservers.
Now, if Facebook and its line of apps don’t work, people will have a natural tendency to keep checking the service. That’s what happened on Monday. User traffic to the site jumped. Even the apps kept trying to make contact until a connection was established. These two events pushed up number of queries to DNS resolvers.
According to Cloudflare, DNS resolvers worldwide jumped 30 times as a result of Facebook outage due to latency and timeout issues. The disappearing act also pushed up Internet traffic to other rival platforms, particularly Twitter, which saw traffic increase on its platform.
The outage comes at a critical time for the social media giant. On Sunday, a Facebook whistleblower, Fances Haugen, went public about the company’s potential to harm teens mental health.