Here's what Facebooks says caused the outage
In a phrase, they accidentally shot themselves in the foot.
Facebook went offline for six hours on Monday, leaving many advertisers and businesses frustrated while users were confused across the globe. It wasn’t just Facebook’s main app, of course, as the rest of the company’s portfolio like Instagram and WhatsApp also went down.
I covered the play-by-play of the afternoon of October 4th which you can read here, and at the time of writing, there wasn’t any official word from Facebook as to exactly what happened. They did release a blog post pretty late on Monday indicating it was a malfunction in a routine maintenance session, but it was clear the company would need to do more explaining to clear any confusion around potential hackers or other foul play.
Yesterday, Facebook’s engineering department published a blog post that went over the events of the outage and what led to it. In a similar fashion to yesterday’s newsletter, here’s an outline.
Facebook says the outage was, in fact, caused by a malfunction in a routine maintenance session. According to the company, it involved a certain command that was run which “unintentionally took down all the connections in [its] backbone network, effectively disconnecting Facebook data centers globally.” (For context, Facebook’s backbone network is essentially the foundation of Facebook’s online presence. Just like how you can’t have a house without a solid foundation, there’s no Facebook on the internet without a strong backbone network.)
The company says the command that was run had a bug in it, but an auditing system that’s designed to look for bugs like this had a bug in itself. This ultimately led to the unintentional disconnection of Facebook’s backbone network, and it doesn’t seem the company was aware until people started complaining.
Those complains didn’t just start on Twitter. People within Facebook were getting locked out of their offices, unable to access their computers and company software to assist with diagnostics. According to Facebook Engineering, the issues reached the point where none of their debugging tools were going to work since they’re all accessed remotely over Facebook’s servers.
“All of this happened very fast,” Santosh Janardhan wrote in the company’s blog. “And as our engineers worked to figure out what was happening and why, they faced two large obstacles: first, it was not possible to access our data centers through our normal means because their networks were down, and second, the total loss of DNS broke many of the internal tools we’d normally use to investigate and resolve outages like this.
“Our primary and out-of-band network access was down, so we sent engineers onsite to the data centers to have them debug the issue and restart the systems. But this took time, because these facilities are designed with high levels of physical and system security in mind. They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them. So it took extra time to activate the secure access protocols needed to get people onsite and able to work on the servers. Only then could we confirm the issue and bring our backbone back online.”
Luckily, Facebook was able to get its backbone network back online once engineers toyed around in the data center. However, the company goes on to explain how it was faced with another challenge: a sudden surge of traffic that was sure to come once they flicked the switch.
“Once our backbone network connectivity was restored across our data center regions, everything came back up with it. But the problem was not over — we knew that flipping our services back on all at once could potentially cause a new round of crashes due to a surge in traffic. Individual data centers were reporting dips in power usage in the range of tens of megawatts, and suddenly reversing such a dip in power consumption could put everything from electrical systems to caches at risk.”
Facebook says it regularly simulates traffic surges like these in what they call “storm” drills. “Experience from these drills gave us the confidence and experience to bring things back online and carefully manage the increasing loads. In the end, our services came back up relatively quickly without any further systemwide failures.”
Janardhan then proceeded to elaborate on how Facebook plans to learn from this outage:
“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one. After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway.
“We’ve done extensive work hardening our systems to prevent unauthorized access, and it was interesting to see how that hardening slowed us down as we tried to recover from an outage caused not by malicious activity, but an error of our own making. I believe a tradeoff like this is worth it — greatly increased day-to-day security vs. a slower recovery from a hopefully rare event like this. From here on out, our job is to strengthen our testing, drills, and overall resilience to make sure events like this happen as rarely as possible.”
The Takeaway
Facebook should be way more careful with this stuff. There were two instances where what seemed to be minor bugs were capable of collapsing Facebook’s foundation, making it impossible for users to access its various apps and products and impossible for Facebook to remotely fix the problems. A mere coincidence between two coding errors was enough to shut down one of the world’s largest online presences for hours.
Let’s just hope Facebook learns its lesson. It has to - it’s literally responsible for billions of users and their interaction with one another.
Google’s Pixel 6 and 6 Pro are coming October 19th
Back in August, Google published images of the Pixel 6 and 6 Pro in an effort to get ahead of any major leaks or spoilers. Certain journalists were even given the opportunity to briefly interact with the devices. Since then, however, Google has been radio silent about the devices other than sharing additional photos and referring to them on social media. Yesterday, the company told the world when they’d finally start talking about the devices in more detail.
Google has confirmed that a virtual press event will be held on October 19th where we’ll learn all there is to know about the new Pixel 6 series. The company is dubbing the event the “Pixel Fall Launch,” so it seems that this will be a naming scheme for Pixel launches in the future.
The event will start at 10 a.m. PT/1 p.m. ET. It isn’t clear if Google will announce anything to coincide with the Pixel 6 line, so stay tuned for coverage.
Android 12 is kind of out, but not yet for Pixel owners
We’ve been waiting for Google to start rolling out Android 12 ever since it was announced, and that time has finally come… sort of.
Google has released the source code and AOSP version of Android 12, which means the software is technically available for download. However, it has yet to formally roll out to any devices, including Pixel phones.
It isn’t clear when Google intends to release Android 12 for Pixel owners. The company specifies it’ll roll out in the coming weeks, so take that for what you will. My guess would be a roll-out coinciding with the release of the Pixel 6 series, but I could be wrong.
Scroll is shutting down, will be baked into Twitter Blue
Scroll, the app that lets you view certain websites (like Matridox) without ads for $5 a month, is shutting down after being acquired by Twitter. In an email to users, the company confirmed its core functionality will be included in every Twitter Blue subscription, so at least it isn’t really going anywhere. That being said, it’s unclear how well website owners will be compensated under Twitter, so I guess you should stay tuned for details as they’re bound to surface. (I’ll also likely get the scoop a little early since I’m part of the program.)
More News
Google launches its new indoor, wired Nest Cam
There’s also a new Nest Doorbell and Nest Cam Floodlight up for sale. The Verge has a good review of the new Cam and Doorbell if you want the scoop on those devices.
Instagram is trying to clean up its video situation
It’s getting rid of the IGTV branding and is going with (drum roll please…) Instagram TV. This basically means any video (besides Reels) will count as “TV” on the app.
There’s a new TicWatch with a Snapdragon 4100, Wear OS 2, an AMOLED display, and a ~$370 price tag
I included the “~” next to the tag because that’s a rough translation of its 2,399 yuan price tag. This watch will only be sold in China, it seems, which is kind of a shame. Then again, it has Wear OS 2, so I guess it’s not that big a deal.
Find My support is now rolling out to Apple’s AirPods Pro and AirPods Max
The new update will let you find your headphones in Apple’s Find My app. 9to5Mac has a lot of extra details, if you’re interested.
T-Mobile cuts the price of its home 5G Wi-Fi service by $10 to $50/month
The Uncarrier is going back to its old ways here, as the service did cost $50/month (taxes and fees included) during its pilot days. Regardless, it’s always nice to see a home internet provider cut prices.
Twitter improves live video quality by not letting you invite friends before broadcasting
I’m not sure how this affects video quality but Twitter insists it does, and for the better.
Amazon is coming for your fridge… by making its own
Do you want your refrigerator to run on Amazon Web Services? I can’t decide if I do.