Let me tell you the Nuremberg data center outage back in September 2024 was a real doozy.
You see the entire facility went down impacting our VPS dedicated servers and object storage.
Talk about a nightmare!
This whole data center outage thing was a real nightmare! 😱 But hey, at least they’re taking steps to prevent it from happening again. I’m glad they’re moving to a new facility and improving their disaster recovery plans. Learn more about Contabo’s new Hub Europe data center Let’s hope this doesn’t happen again! 🙏
What Went Wrong?
This whole data center outage thing was a real nightmare! 😱 But hey, at least they’re taking steps to prevent it from happening again. I’m glad they’re moving to a new facility and improving their disaster recovery plans. Learn more about Contabo’s new Hub Europe data center Let’s hope this doesn’t happen again! 🙏
The root cause? A severe thunderstorm with lightning strikes around Nuremberg which caused a voltage fluctuation on the local electricity grid.
This triggered our systems to switch to UPS our uninterruptible power supply as a safety measure.
It’s a standard process but here’s the catch: the UPS activated for a mere three seconds before the primary power supply kicked back in.
However this brief power disruption was enough to cause our cooling system to shut down.
This was due to a malfunction in the control bus which prevented the system from restarting automatically.
We tried to restart the chillers manually but to no avail.
It wasn’t until we brought in a technician from the company who provided the chilling units that they performed a hard reboot finally getting the cooling system back online.
It Was Like a Chain Reaction
Let’s break it down:
The Temperature Climbed
The air conditioning system was essential to keeping our servers and network equipment running smoothly.
When the chillers went down the temperature in the data center began to rise rapidly.
Our monitoring system sent alerts but the problem wasn’t immediately apparent.
Systems Began to Shut Down
As the temperature continued to climb we had no choice but to shut down our systems to prevent damage.
This included our Customer Control Panel support channels and most importantly our customer’s servers.
We simply couldn’t risk losing valuable data.
The Timeline: A Race Against Time
The situation was tense and the clock was ticking.
Here’s a detailed timeline of the event:
- Sep 2 2024 07:14 AM: Power fluctuations detected power supply switched to UPS. Servers continued to work but the cooling system was shut down.
- Sep 2 2024 07:14 AM: Power supply from the grid was reestablished after 3 seconds but the cooling system remained offline. Our incident response team was notified and began working on a solution.
- Sep 2 2024 07:14 AM: Our monitoring system sent an alert about the UPS switching and chillers being down. We knew something was wrong and the temperature started to increase.
- Sep 2 2024 07:33 AM: Another alert triggered indicating the critical temperature threshold in the first server room was reached. The situation was becoming serious.
- Sep 2 2024 08:13 AM: We had to take action. The first Contabo systems were shut down.
- Sep 2 2024 08:41 AM: Our on-site team assessed the situation and realized they couldn’t manually turn on the cooling system. We immediately contacted a technician from the cooling systems company but they weren’t readily available as they were already dealing with similar incidents in the area.
- Sep 2 2024 11:30 AM – 12:08 PM: The temperature continued to rise. One server room after another exceeded the secure threshold forcing us to shut down servers to protect them from damage.
- Sep 2 2024 12:55 PM: Fortunately the rain finally stopped. This allowed us to open the smoke protection flaps for ventilation. We also activated industrial ventilators to move the hot air out faster. We could see the temperature starting to lower a glimmer of hope.
- Sep 2 2024 13:55 PM: We were able to restore our core network links and components.
- Sep 2 2024 14:25 PM: The technician finally arrived and was able to restart the cooling system. A huge sigh of relief!
- Sep 2 2024 15:05 PM: As the temperature continued to decrease we were able to gradually bring our servers back online.
- Sep 2 2024 15:30 PM: The object storage cluster was back up and running.
- Sep 2 2024 15:42 PM: Contabo systems including the Customer Control Panel were fully restored.
- Sep 2 2024 18:00 PM: 95% of our servers were back online.
- Sep 3 2024 19:55 PM: The incident was officially resolved. However individual reports of virtual and dedicated server issues were handled by our technical support team as usual.
What Went Wrong with Redundancy?
You might be thinking “Contabo is known for its redundancy! How could this happen?” That’s a great question.
You see while our Nuremberg data center has an N+1 redundancy for critical systems like chilling units power and internet connectivity the malfunction in the control bus effectively nullified the redundancy.
We had backups in place but they couldn’t overcome this unexpected issue.
Taking Action: Learning from Our Mistakes
This incident taught us a lot about the importance of being prepared for the unexpected.
It’s not enough to have redundancy in place; we need to ensure our systems are resilient enough to withstand unforeseen events.
Here’s what we’re doing to prevent something like this from happening again:
Moving to a New Facility
The first and most significant action we took was to migrate all of our Nuremberg customers to our newly built Hub Europe data center.
This facility is designed to meet Tier 3 standards ensuring higher availability with even more robust safeguards.
It’s a state-of-the-art facility with advanced redundancy and monitoring systems.
Revamping Disaster Recovery Plans
We’re also revisiting our disaster recovery plans and fallback procedures for Contabo systems including our Customer Control Panel and support channels.
We’re making sure these systems are always available even in the face of a major incident.
Improving Incident Response
We recognize that our customers rely on us so we’re working hard to improve our incident response process.
We want to resolve issues faster and keep customers informed throughout the entire process.
We’re committed to embodying German quality and that includes providing the highest level of service and reliability.
Moving Forward
The Nuremberg data center outage was a serious incident but it also gave us an opportunity to grow and learn.
We’re committed to doing everything we can to ensure that this kind of incident never happens again.
We’ll continue to invest in cutting-edge technology implement robust safeguards and build a culture of preparedness.
We appreciate your patience and understanding during this event.
We value your trust and are dedicated to providing the best possible service.
This whole data center outage thing was a real nightmare! 😱 But hey, at least they’re taking steps to prevent it from happening again. I’m glad they’re moving to a new facility and improving their disaster recovery plans. Learn more about Contabo’s new Hub Europe data center Let’s hope this doesn’t happen again! 🙏