The summer of 2024 was a scorcher and it felt like everyone was sweating it out.
But for us the heat wasn’t just an inconvenience; it was a serious threat to our Nuremberg data center.
On September 2nd a massive thunderstorm rolled in unleashing a torrent of lightning strikes across Franconia.
It was like a scene out of a disaster movie except this wasn’t Hollywood – it was our reality.
The data center is equipped with lightning rods to protect it from direct strikes.
However lightning can wreak havoc on power lines and infrastructure miles away and that’s precisely what happened.
The sudden jolt of electricity triggered our systems to switch to uninterruptible power supply (UPS) as a backup.
It’s a standard safety measure and the UPS kicked in flawlessly powering our systems for a few critical seconds.
But here’s where things took a turn for the worse.
The cooling system which relies on a sophisticated network of chillers and air conditioning units automatically shuts down when we switch to UPS power.
This is designed to prevent damage to the chillers during the power transition.
However when the power returned the cooling system refused to come back online.
It turned out that a control bus within the cooling system had malfunctioned preventing it from automatically restarting.
We tried everything manually but our efforts were futile.
It was a frustrating situation – a perfectly designed safety measure was now the culprit behind the outage.
The Temperature Rises
We were now in a race against time.
As the servers continued to hum away the data center temperature began to climb steadily.
Every minute felt like an eternity.
Our monitoring systems screamed with alarm alerting us that the first server room was approaching a critical temperature threshold.
We had to act quickly but our hands were tied.
We contacted the company that supplied the chilling units but they were also facing a surge in service calls due to the widespread storm damage.
The technician arrived several hours later and finally after a hard reboot the cooling system sprang back to life.
It was a relief to finally see the temperature start to decline but the damage was done.
A Timeline of Trouble
Here’s a breakdown of the events with the timestamps in Central European Summer Time (CEST):
September 2 2024:
- 07:14 AM: The first power fluctuation is detected and our systems seamlessly switch to UPS. The cooling systems shut down as designed.
- 07:14 AM: The power supply from the grid returns after a brief three-second interruption. However the cooling system remains offline.
- 07:14 AM: Our monitoring systems trigger an alert notifying the data center staff that the UPS is cycling and the chillers are down. The temperature inside the data center is starting to creep upward.
- 07:33 AM: The first server room reaches a critical temperature level and the data center staff begin to assess the situation.
- 08:13 AM: The first Contabo systems are shut down to prevent damage from overheating.
- 08:41 AM: The on-site team realizes they cannot manually restart the cooling systems and call for a technician from the chilling unit company. The technician is dispatched but is already busy dealing with other businesses affected by the storm.
- 11:30 AM – 12:08 PM: As the cooling system remains offline the temperature continues to rise. Server rooms one by one exceed the safe threshold and servers are shut down to prevent damage and potential data loss.
- 12:55 PM: The rain finally subsides allowing us to open the smoke protection flaps for ventilation. We also activate industrial ventilators to help move the hot air out more quickly. The temperature starts to drop but it’s a slow process.
- 13:55 PM: The core network links and components are restored allowing communication to resume.
- 14:25 PM: The cooling system finally restarts after the technician arrives and performs a hard reboot.
- 15:05 PM: As the temperature continues to decrease we start bringing the servers back online gradually.
- 15:30 PM: The object storage cluster is back online.
- 15:42 PM: Contabo systems including the Customer Control Panel are fully restored.
- 18:00 PM: Around 95% of the servers are back online.
September 3 2024:
- 19:55 PM: The incident is officially declared resolved. Individual server issues are handled by the technical support team as usual.
A Lesson Learned
It was a harrowing experience and looking back it’s clear that even with our N+1 redundancy systems we were not entirely prepared for an event of this magnitude.
Our redundancy design aims to ensure continuous operation even if one critical system fails but the failure in the cooling system’s control bus was an unforeseen weakness.
Moving Forward
This incident has been a wake-up call and we’re taking steps to ensure it doesn’t happen again:
1. Hub Europe Migration:
We’ve already begun migrating customers from Nuremberg to our newly built Hub Europe data center.
This facility is designed to meet the stringent Tier 3 standards for data centers with enhanced fail-safe measures.
The goal is to provide our customers with the highest levels of reliability and uptime.
2. Enhanced Disaster Recovery:
We are thoroughly reviewing and strengthening our disaster recovery plans for Contabo systems like the Customer Control Panel and support channels.
These systems are vital to our customers and we need to ensure their uninterrupted operation even during major incidents.
This includes exploring alternative locations and implementing more robust backup strategies.
3. Improved Incident Response:
We’re revising our incident response processes to make them faster more effective and more transparent to our customers.
We know that communication is critical during outages and we’re committed to providing clear and timely updates to our customers.
This incident highlighted the importance of being prepared for the unexpected.
While we strive to provide seamless and reliable service the reality is that disruptions can happen and we need to be ready.
We are dedicated to learning from this experience and are continuously improving our infrastructure and processes to ensure the highest level of service for our customers.
Our commitment to quality is unwavering and we’ll keep you updated on our progress as we work towards a more robust and resilient future.