Facebook admits its engineers made mistake that caused massive outage

Facebook admits its engineers made mistake that caused $100m seven-hour outage and not hackers: WFH policy in data centers and AI blamed for long delay in fixing ‘faulty configuration’

  • Facebook said on Tuesday that there was ‘no malicious activity’ behind a historic outage to its services
  • Company suffered record seven-hour blackout on Monday after an update to its core servers went awry 
  • Outage was made worse by ‘lower staffing in data centers due to pandemic measures’, one insider said 
  • Bug also brought down internal messaging service, meaning remote-working engineers who knew how to fix the problem were unable to communicate with those inside the data center
  • Outage is thought to have cost Facebook Inc. up to $100million in direct revenue, with another $47billion wiped off its stock market value in the company’s second-worst day on record 
  • Mark Zuckerberg, who has championed work from home in the past, saw his fortune shrink by some $7billion  

Facebook has said there was ‘no malicious activity’ behind a seven-hour blackout that cost the company an estimated $100million in lost revenue, which experts and insiders say was exacerbated by remote working policies. 

The crisis came on a week of cascading disasters for Facebook, as a whistleblower testified before Congress slamming the company’s artificial intelligence content algorithms as harmful and divisive. 

The company quietly updated a prior blog post on Tuesday to say that there was no malicious intent behind the historic outage, meaning that an employee error is most likely to blame. 

It is believed that a faulty update to Facebook’s Border Gateway Protocol (BGP), which routes traffic between large private networks and the public Internet, left apps and browsers unable to locate the company’s services. 

The global outage – which hit Facebook, Instagram, WhatsApp and Messenger on Monday – was caused when the faulty configuration disconnected its servers from the internet, meaning engineers had to travel to its Santa Clara data center to fix the glitch in-person.

But the repair was delayed, according a purported insider, because of ‘lower staffing in data centers due to pandemic measures’, along with outages in physical access card systems and internal messaging services. 

Kieron Harding, an IT Infrastructure Engineer at GRC International Group, told DailyMail.com: ‘The nature of the problem meant Facebook would have needed network engineers to physically access their BGP routers – and due to the pandemic, some of the data centers quite possibly don’t have an engineer based on site, or someone who could have immediately started to work on the problem.’ 

‘One of the reasons why the outage lasted for as long as it did was because the misconfiguration of the BGP also affected Facebook’s physical door access systems – which shut down; meaning engineers couldn’t get into the buildings, or secure rooms, to start fixing the issues straightaway,’ said Harding. 

Facebook operates dozens of offices and data centers around the US. Monday’s outage reportedly knocked out physical access to the company’s facilities when key card systems went offline

The glitch, which has prompted calls for a break-up of big tech firms, also brought down messaging services that remote-working staff use to communicate, so those who knew how to fix the servers couldn’t get that information to the teams inside the data-center, the insider said. 

‘There are people now trying to gain access to… implement fixes, but the people with physical access is separate from the people with knowledge of how to authenticate the systems and people who know what to actually do, so there is now a logistical challenge,’ the purported insider said on Reddit.

Industry sources who have worked closely with the tech giant say Facebook is suffering from two major problems: Staff working from home and over reliance on artificial intelligence.

The social media site has been beset by bugs, glitches and AI issues for months – exacerbated by staff not being on premises to deal with or correct issues.

One source said that Facebook is simply unprepared to deal with emergencies and ‘is very weak on the technical side’. Another added Facebook is currently ‘a shambles’ and has been beset with tech problems ‘for months’.

They added: ‘They think they can do everything with AI – but their tech isn’t up to scratch. I’m inclined to think it’s because they’re WFH.’

Monday’s outage was partly to blame for a nose-dive in Facebook’s share price that saw $47billion wiped from its market value in its second-worst day ever on the stock market, also driven by a whistleblower testifying about the harms the site does to teenagers in Congress this week. 

Facebook shares rebounded on Tuesday, rising 2.3 percent in midday trading. 

In addition to the stock market slide during the outage, Facebook likely missed out on at least $67million in direct revenue and possibly as much as $102million during the outage – based on average hourly earnings across 2020 and projections of its 2021 hourly earnings from Q1 and Q2 results. 

A person claiming to be a Facebook employee said on Reddit that high numbers of staff working from home made the problem worse. The account was later deleted 

Users around the world reported problems with Facebook, Instagram and WhatsApp on Downdetector

Mark Zuckerberg – who lost around $7billion in stock value amidst the carnage – has previously vowed to make work from home a permanent part of Facebook, telling staff back in June that ‘anyone whose role can be done remotely can request remote work.’ 

The multi-billionaire said he plans to spend around half his time working remotely in 2022, and predicted that half of his staff could be permanently off-site by 2030.

Facebook’s office are currently open but only to 25 per cent capacity, after plans to open fully by October were pushed back to at least January 2022 amid the spread of the Delta Covid variant. 

Of the staff who are not currently in the office, it is not clear how many will become permanent remote workers.

But a Facebook executive previously told the Wall Street Journal that the company has approved 90 per cent of WFH requests. The only caveat is that salaries may be cut to reflect the locations where people are actually working, as opposed to where the office is based.

Data centre staff are among those who cannot request a permanent WFH.  

Facebook’s problems began around midday Eastern Time (5pm GMT) on Monday, shortly after its servers were updated, and lasted until around 5.45pm (10.45pm GMT) when the servers came back online. It took several more hours for all users to be able to access Facebook’s sites and apps. 

Following Monday’s outage, Zuckerberg issued a personal apology to Facebook users – telling them ‘sorry for the disruption’ while adding: ‘I know how much you rely on our services.’ 

But his message was immediately attacked from all sides, with those who use Facebook business saying he failed to take the issue seriously while casual users accused him of ‘making yourself more important than you are’.

Twitter founder Jack Dorsey appeared to make light of Facebook’s plight on Monday. Responding to a post which appeared to show how the facebook.com domain is for sale as a result of the outage, he jokingly asked: ‘How much?’

A Facebook staff member reportedly accidentally deleted large sections of the code (pictured) which keeps the website online

The above Tweet read: ‘So, someone deleted large sections of the routing….that doesn’t mean Facebook is just down, from the looks of it….that means Facebook is GONE’

Facebook shares are down by more than 6 percent from last week as a result of the outage on Monday

Still others said they had enjoyed the outage, and were planning to spend more time off social media in the future. ‘Life was way simpler without these services,’ wrote one. 

John Graham-Cumming, the chief technology officer of web security firm Cloudflare, said Facebook made a series of updates to its border gateway protocol (BGP) which caused it to ‘disappear’ from the internet. 

The BGP allows for the exchange of routing information on the internet and takes people to the websites they want to access.  

Dane Knecht, senior vice president of the firm, said earlier the Facebook Border Gateway Protocol (BGP) routes had been ‘withdrawn from the internet.’ 

Cybersecurity expert, Kevin Beaumont, wrote on Twitter: ‘This one looks like a pretty epic configuration error, Facebook basically don’t exist on the internet right now. Even their authoritative name server ranges have been BGP withdrawn.’   

Facebook, Instagram and WhatsApp were all brought down for almost seven hours yesterday in a massive global outage. The US tech giant said the problem was caused by a faulty update that was sent to its core servers, which effectively disconnected them from the internet

WhatsApp, Instagram and Facebook Messenger, run on a shared back-end infrastructure, creating a ‘single point of failure’ according to experts.

It wasn’t just the main Facebook apps going down, other services, including Facebook Workplace and the Oculus website were also down. 

The EU’s competition commissioner said it shows why large tech firms should be broken up to avoid a similar failure of multiple platforms at once.

EU competition commissioner Margrethe Vestager said the incident highlighted the negative impact of big tech firms controlling large swathes of the online world. 

‘We need alternatives and choices in the tech market, and must not rely on a few big players, whoever they are,’ she wrote on Twitter. 

The dominance of a handful of large social media and internet companies has come under scrutiny from competition watchdogs on a number of issues, with many campaigners in the UK, Europe and US urging governments and regulators to take steps to break up larger firms to prevent monopolies being created.

IT experts have also called on the tech industry to come up with better systems to prevent a single error from having such a wide impact.

Ms Vestager, who is also the European Commission’s executive vice-president for a Europe fit for the digital age, added that the incident showed it was also sometimes good to step away from social media and talk to people ‘offline’. 

Facebook’s Chief Technology Officer, Mike Schroepfer, offered his ‘sincere apologies’ for the outage on Monday afternoon.  The scandal-hit company’s shares had dipped by 5 percent on Monday amid the outage and after a whistleblower went public on Sunday night with claims that the firm prioritises ‘growth over safety’. 

There have been a number of social media outages in recent months, with Instagram going down for 16 hours just last month, and all Facebook platforms going offline in June. 

Twitter founder Jack Dorsey appeared to make light of Facebook’s plight on Monday. Responding to a post which appeared to show how the facebook.com domain is for sale as a result of the outage, he jokingly asked: ‘How much?’ 

The cause of the outage remains unconfirmed and it’s unclear if all are linked but not long before Facebook’s entities went down, entries for Facebook and Instagram were removed from the Domain Name System (DNS) it uses. 

A DNS is essentially an internet directory. Whenever someone opens a link or an app, their device has to search the DNS used by the service they are trying to access to find it and then connect them to it. 

Major DNS providers are Google, Amazon and Cloudflare. It’s unclear if all of the sites and services that went down on Monday use the same DNS or not. 

A similar outage at cloud company Akamai Technologies Inc took down multiple websites in July.

Cloudflare’s Mr Graham-Cumming tweeted on Monday that Facebook accidentally ‘disappeared’ from the internet after making a ‘flurry’ of updates to its BGP – Border Gateway Protocol.   

‘Between 15:50 UTC and 15:52 UTC [4.50-4.52pm UK time] Facebook and related properties disappeared from the Internet in a flurry of BGP updates,’ he said.   

When sites go down because of failures in DNS systems, CloudFare tries to repair them. 

Usman Muzaffar, SVP of engineering at Cloudflare, explained to DailyMail.com: ‘Humans access information online through domain names, like facebook.com and DNS converts it into numbers, called an IP address, computers use. 


The Domain Name System, or DNS, is the directory of the internet.

Whenever you click on a link, send an email, open a mobile app, often one of the first things that has to happen is your device needs to look up the address of a domain. 

There are two sides of the DNS network: the authoritative side, ie webpages and other content, and the resolver side, devices that are trying to access this content.

Every domain needs to have an authoritative DNS provider, servers which store DNS records. Amazon, Cloudflare and Google are among the bigger names in authoritative DNS server provision. 

On the other side of the DNS system are resolvers. Every device that connects to the Internet needs a DNS resolver. 

By default, these resolvers are automatically set by whatever network you’re connecting to. 

So, for most Internet users, when they connect to an ISP, or a WiFi hot spot, or a mobile network, the network operator will dictate what DNS resolver to use.

The problem is that these DNS services are often slow and don’t respect your privacy. 

What many Internet users don’t realise is that even if you’re visiting a website that is encrypted, indicated by the green padlock in your browser’s address bar, that doesn’t keep your DNS resolver from knowing the identity of all the sites you visit. 

That means, by default, your ISP, every WiFi network you’ve connected to, and your mobile network provider have a list of every site you’ve visited while using them. 

‘From what we understand of the actual issue —it is a globalized BGP configuration issue. In our experience, these usually are mistakes, not attacks.

‘Border Gateway Protocol (BGP) is the routing protocol for the Internet. Much like the post office processing mail, BGP picks the most efficient routes for delivering Internet traffic. 

‘Today, the directions for how to get to Facebook’s DNS server’s addresses weren’t available (and seem to still be unavailable). Without being able to contact the DNS servers, visitors trying to reach a Facebook property, like facebook.com, will not get an answer and so the page won’t load.’

According to Reuters news agency, security experts tracking the situation said the outage could have been triggered by a configuration error, which could be the result of an internal mistake, though sabotage by an insider would be theoretically possible.

An outside hack was viewed as less likely. A massive denial-of-service attack that could overwhelm one of the world’s most popular sites, on the other hand, would require either coordination among powerful criminal groups or a very innovative technique.

Shares of Facebook, which has nearly 2 billion daily active users, fell 5.5 per cent in afternoon trading on Monday, inching towards its worst day in nearly a year. 

It means that the company’s founder Mark Zuckerberg – who owns around 14 per cent of the firm – has seen his wealth plummet by nearly $7billion in a matter of hours, Bloomberg reported. 

Some users of UK phone network EE were also reporting that they were having difficulty accessing mobile internet services. However, the firm told MailOnline that there were no problems with the network.

Cyber security specialist Jake Moore said there is a ‘chance’ the issue affecting the firms could be related to a cyber attack.

He said: ‘There have been many reports and I’m struggling to find out exactly what has happened- I’m reading it could be DNS related, which means there is an issue with the connection not knowing where to go to your device.

‘It could well be a human error or a software bug lurking in the shadows but whatever it is Facebook needs to do its best to mitigate the problem of causing more panic about this.

‘The biggest problem is fears over a cyber attack but as we saw from Fastly in the summer I would hedge my bets on that not being the case as we’re talking about one of the biggest companies in the world, but there’s always a chance.’

Apologising on Twitter for the outage, Mr Schroepfer said: *’Sincere* apologies to everyone impacted by outages of Facebook powered services right now. We are experiencing networking issues and teams are working as fast as possible to debug and restore as fast as possible.’

Facebook was already in the throes of a separate major crisis after whistleblower Frances Haugen, a former Facebook product manager, provided The Wall Street Journal with internal documents that exposed the company’s awareness of harms caused by its products and decisions. 

Haugen went public on CBS’s ’60 Minutes’ program Sunday and is scheduled to testify before a Senate subcommittee Tuesday.

Haugen had also anonymously filed complaints with federal law enforcement alleging Facebook’s own research shows how it magnifies hate and misinformation and leads to increased polarization. It also showed that the company was aware that Instagram can harm teenage girls’ mental health.

The Journal’s stories, called ‘The Facebook Files,’ painted a picture of a company focused on growth and its own interests over the public good. Facebook has tried to play down the research. 

Former Deputy Prime Minister Nick Clegg, the company’s vice president of policy and public affairs, wrote to Facebook employees in a memo Friday that ‘social media has had a big impact on society in recent years, and Facebook is often a place where much of this debate plays out.’

Earlier on Twitter, Facebook communications executive, Andy Stone said they were aware some people were having trouble accessing Facebook apps and products.

‘We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience,’ the executive said in a tweet.

Soon after the first report came through, the hashtag #facebookdown was trending on Twitter, with users worldwide reporting issues connecting. 

The hashtag #instagramisdown and ‘WhatsApp’ were both also trending on Twitter, with a number of users saying they checked their internet connection when they couldn’t get on Facebook. 

Instagram comms tweeted: ‘Instagram and friends are having a little bit of a hard time right now, and you may be having issues using them. Bear with us, we’re on it!’

NetBlocks, which tracks internet outages, tweeted: ‘Facebook, WhatsApp, Instagram and Messenger are currently experiencing outages in multiple countries’

They’re some of the most popular social media apps around the world, but it appears that Facebook, Instagram, WhatsApp and Facebook Messenger have all crashed this afternoon. Above: The reports of Facebook outages reported on DownDetector

Down Detector also showed how problems with Whatsapp began being reported just before 5pm on Monday

Facebook Messenger’s outage was also reported on DownDetector at a similar time on Monday

At 11.33pm on Monday, Facebook tweeted to apologise about the global outage of its services. They added that they were ‘happy to report’ that they were coming back online

On Twitter, Facebook communications executive, Andy Stone said they were aware some people were having trouble accessing Facebook apps and products

The various Facebook-owned platforms, including WhatsApp, Instagram and Facebook itself, took to Twitter to explain the issues and say they were working on a solution


Last month, a technical issue with Facebook owned Instagram caused an outage that plagued users around the world for 16 hours.

Problems started just after 8am on Thursday. About 18 hours later, at 2am on Friday, Instagram announced the problem had been fixed.

However, the last time Facebook, Instagram and WhatsApp went down at the same time was in June. 

In June more than a thousand people in countries including the United States, Morocco, Mexico, Bolivia and Brazil reported outages.

There were also two Facebook platform outages in March, with Instagram down on March 30, and all three down on March 19. 

WhatsApp tweeted: ‘We’re aware that some people are experiencing issues with WhatsApp at the moment. We’re working to get things back to normal and will send an update here as soon as possible. Thanks for your patience!’

Even Oculus, the virtual reality gaming platform owned by Facebook was having problems, with one user describing their headset as being ‘like a paperweight’.

Oculus tweeted: ‘We’re aware that some people are having trouble accessing our apps and products. We’re working to get things back to normal as quickly as possible, and we apologize for any inconvenience.’ 

Every time Facebook and Instagram are down, it draws people to Twitter. 

On Monday, one user shared a meme of Homer Simpson jumping from his house to Moe’s bar, with the Twitter logo over the door.

Even Google got in on the action, tweeting: ‘Everyone going to Google to check if Instagram is down.’ 

There were multiple jokes along the same lines, with one showing a fast track race and the caption: ‘Me and my friends running to twitter to see if fb, whatsapp and insta are down.’ 

It is unclear what has caused the issue, although it has disrupted all Facebook owned platforms, including the Oculus Virtual Reality gaming website and Facebook Workplace.

NetBlocks, which tracks internet outages, tweeted: ‘Facebook, WhatsApp, Instagram and Messenger are currently experiencing outages in multiple countries.’

Adding that the ‘incident not related to country-level internet disruptions or filtering.’

When attempting to visit Instagram using a desktop web browser, it gives up a ‘5xx Server Error’ and Facebook simply says ‘this site can’t be reached.’ 

The last major outage of Facebook platforms was in June 2021, when people in the US, Morocco, Mexico and Brazil all reported not being able to connect. 

However, there were also problems last month, when Instagram went down for a whopping 16 hours. 

Jake Moore, cybersecurity specialist at ESET said outages are increasing in volume and are becoming increasingly harder to predict.

He said initially, a major problem with a website or app can point towards a cyber attack – but that can add to confusion and be misleading.

‘With recent issues such as what happened with Fastly [the web service platform that saw a major global outage on June 8] it highlights the power of an undiscovered software bug or even human error,’ Moore explained.

‘Although these are increasing in frequency and require more failsafes in place, predicting these issues is increasingly more difficult as it was never thought possible before. Luckily, most outages only last under an hour.’

This latest outage highlights the major issues with using centralised systems, according to Matthew Hodgson, Co-founder and CEO of Element and Technical Co-founder of Matrix.

‘The ongoing outage of WhatsApp, Instagram and Facebook (including Facebook Messenger and Facebook Workplace) highlights that global outages are one of the major downsides of a centralised system,’ he said.

Centralised apps, like having a single back end for Facebook products, means putting ‘all the eggs in one basket,’ Hodgson explained.

‘Decentralised systems are far more reliable. There’s no single point of failure so they can withstand significant disruption and still keep people and businesses communicating.’

Source: Read Full Article