Scheduled Maintenance: Lessons Learned

Sorry and Thank You

Last week we had a maintenance window that was scheduled for 12 hours. Instead, our core services were offline for a day and a half with some backups being throttled for an additional day.

I am very sorry for any inconvenience that this caused. Rest assured, we take any disruptions in service very seriously. I also wanted to thank our customers; I was amazed at how calm and supportive they were during this time.

As of Saturday at 3 p.m., everything has been working and the service for all users is live. Please note that at no time was your backed up data at risk.

What follows is more technical details on what happened and what we intend to do.

The Original Maintenance Plan

We worked on our “central authority” cluster that maintains customer metadata, handles billing, prepares restores, etc. This is unrelated to the Storage Pods where all the backed up data is stored.

The maintenance plan was to migrate the metadata to another server running an upgraded OS and then to update permissions on that data. Operations like this with large volumes of data take time. We estimated the time based on a previous maintenance and wrote a multi-threaded script to update the permissions in an attempt to accelerate the process.

What Happened

Due to the large data growth since our previous maintenance, we did not properly account for the time required. Then, the permissions update script failed to update all the files because there were too many threads. Rather than trying to fix the multi-threaded script in a rush, we ran the script single-threaded. This took quite a bit longer to run, but was safer than trying to rewrite code in a hurry.

We brought the site back up on Friday afternoon, but all customers starting to backup concurrently overwhelmed the system. We brought the service down briefly and started slowly allowing customers to backup again. (During this time, restores and other services were fully functional.) By 3 p.m. Saturday, all customers were fully operational.

Takeaways

Taking a step back, we had a few basic lessons learned:

      1. 1. Estimate better.

This is not just an “eat your vegetables” approach. We have the data to produce better estimates. Specifically, we will factor in data growth rather than using previous maintenance experience.

      1. 2. Limit the permissions updating process to 20 threads.

Threads are good. Too many threads are not.

      1. 3. Bring the site back online in stages.

We have a lot of users. They have a lot of data. When the site comes online there is a massive flood of data and requests that strain the service. Bring users back incrementally.

Again, thank you for your patience and we hope to keep helping you protect your data for a long time to come.

print

About Gleb Budman

Gleb Budman is a co-founder and has served as our chief executive officer since 2007, guiding the business from its inception in a Palo Alto apartment to a company serving customers in more than 175 countries with over an exabyte of data under management. Gleb has served as a member of our board of directors since 2009 and as chairperson since January 2021. Prior to Backblaze, Gleb was the senior director of product management at SonicWall and the vice president of products at MailFrontier, which was acquired by SonicWall. Before that, he served in a senior position at Kendara, which was acquired by Excite@Home, and previously founded and successfully exited two other startup companies.