It’s been over a year since Backblaze revealed the designs of our first generation (67 terabyte) storage pod. During that time, we’ve remained focused on our mission to provide an unlimited online backup service for $5 per month. To maintain profitability, we continue to avoid overpriced commercial solutions, and we now build the Backblaze Storage Pod 2.0: a 135-terabyte, 4U server for $7,384. It’s double the storage and twice the performance—at lower cost than the original.
In this post, we’ll share how to make a 2.0 storage pod, and you’re welcome to use the design. We’ll also share some of our secrets from the last three years of deploying more than 16 petabytes worth of Backblaze storage pods. As before, our hope is that others can benefit from this information and help us refine the pods. (Some of the enhancements are contributions from helpful kindred pod builders, so if you do improve your Backblaze pod farm, please balance the Karma and send us your suggestions!)
Quick Review – What makes a Backblaze Storage Pod
A Backblaze Storage Pod is a self-contained unit that puts storage online. It’s made up of a custom metal case with commodity hardware inside. You can find a parts list in Appendix A. You can also link to a power wiring diagram, see an exploded diagram of parts, and check out a half-assembled pod. The two most noteworthy factors are that the cost of the hard drives dominates the price of the overall pod and that the system is made entirely of commodity parts. For more background, read the original blog post. Now let’s talk about the changes.
Density Matters – Double the Storage in the Same Enclosure
We upgraded the hard drives inside the 4U sheet metal pod enclosure to store twice as much data in the same space. After the cost of filling a rack with pods, one datacenter rack containing 10 pods costs Backblaze about $2,100 per month to operate, roughly divided equally into thirds for physical space rental, bandwidth, and electricity. Doubling the density saves us half of the money spent on both physical space and electricity. The picture below is from our datacenter, showing 15 petabytes racked in a single row of cabinets. The newest cabinets squeeze one petabyte into three-quarters of a single cabinet for $56,696.
Our online backup cloud storage is our largest cost, and we are obsessed with providing a service that remains secure, reliable and, above all, inexpensive. We’ve seen competitors unable to react to these demands who were forced to exit the market, like Iron Mountain, or raise prices, like Mozy and Carbonite. Controlling the hardware design has allowed us to keep prices low.
We are constantly looking at new hard drives, evaluating them for reliability and power consumption. The Hitachi 3TB drive (Hitachi Deskstar 5K3000 HDS5C3030ALA630) is our current favorite for both its low power demand and astounding reliability. The Western Digital and Seagate equivalents we tested saw much higher rates of popping out of RAID arrays and drive failure. Even the Western Digital Enterprise Hard Drives had the same high failure rates. The Hitachi drives, on the other hand, perform wonderfully.
Twice as Fast
We’ve made several improvements to the design that have doubled the performance of the storage pod. Most of the improvements were straightforward and helped by Moore’s Law. We bumped the CPU up from the Intel dual core CPU to the Intel i3 540 and upgraded the motherboard from one Gigabit Ethernet port to a Supermicro motherboard with two Gigabit Ethernet ports. RAM dropped in price, so we doubled it to 8 GB in the new pod. More RAM enables our custom Backblaze software layer to create larger disk caches that can really speed up certain types of disk I/O.
In the first generation storage pod, we ran out of the faster PCIe slots and had to use one slower PCI slot, creating a bottleneck. Justin Stottlemyer from Shutterfly found a better PCIe SATA card, which enabled us to reduce the SATA cards from four to three. Our upgraded motherboard has three PCIe slots, completely eliminating the slower PCI bottleneck from the system. The updated SATA wiring diagram is seen below. Hint: The pod will work if you connect every port multiplier backplane to a random SATA connection, but if you wire it up as shown below, the 45 drives will appear named in sequential order.
We upgraded the Linux 64-bit OS from Debian 4 to Debian 5, but we no longer use JFS as the file system. We selected JFS years ago for its ability to accommodate large volumes and low CPU usage, and it worked well. However, ext4 has since matured in both reliability and performance, and we realized that with a little additional effort we could get all the benefits and live within the unfortunate 16 terabyte volume limitation of ext4. One of the required changes to work around ext4’s constraints was to add LVM (Logical Volume Manager) above the RAID 6 but below the file system. In our particular application (which features more writes than reads), ext4’s performance was a clear winner over ext3, JFS, and XFS.
With these performance improvements, we see the new storage pods in our datacenter accepting customer data more than twice as fast as the older generation pods. It takes approximately 25 days to fill a new pod with 135 terabytes of data. The chart below shows the measured fill rates of an old Pod versus a new Pod, both under real-world maximum load in our datacenter.
Please note: The above graph is not the benchmarked write performance of a pod; we have easily saturated the Gigabit pipes copying data from one pod to another internally. This graph shows pods running in production, accepting data from thousands of simultaneous and independent desktop machines running Windows and Mac OS, where each desktop is forming HTTPS connections to the Tomcat web server and pushing data to the pod. At the same time, as customers are preparing restores that read data off those drives, there are system cleanup processes running, occasional RAID repairs, etc. In this end-to-end measurement, the new pods are twice as fast in our environment.
Lessons Learned: Three Years, 16 Petabytes and Counting
Backblaze is employee owned (with no VC funding or other deep pockets), so we have two choices: 1) stay profitable by keeping costs low or 2) go out of business. Staying profitable is not just about upfront hardware costs; there are ongoing expenses to consider.
One of the hidden costs to a datacenter is the headcount (salary) for the employees who deploy pods, maintain them, replace bad drives with good, and generally manage the facility. Backblaze has 16 petabytes and growing, and we employ one guy (Sean) whose fulltime job is to maintain our fleet of 201 pods, which hold 9,045 drives. Typically, once every two weeks, Sean deploys six pods during an eight-hour work day. (He gets a little help from one of us to lift each pod into place because they each weigh 143 pounds.)
Our philosophy is to plan for equipment failure and build a system that operates in spite of it. We have a lot of redundancy, ensuring that if a drive fails, immediate replacement isn’t critical. So at his leisure, Sean also spends one day each week replacing drives that have gone bad. As of this week, Backblaze has more than 9,000 hard drives spinning in the datacenter, the oldest of which we purchased four years ago. We see fairly high infant mortality on the hard drives deployed in brand new pods, so we like to burn the pods in for a few days before storing any customer data. We have yet to see any drives die because of old age, which will be fascinating to monitor in the next few years. All told, Sean replaces approximately 10 drives per week, indicating a 5 percent per year drive failure rate across the entire fleet, which includes infant mortality and also the higher failure rates of previous drives. (We are currently seeing failures in less than 1 percent of the Hitachi Deskstar 5K3000 HDS5C3030ALA630 drives that we’re installing in pod 2.0.)
We monitor the temperature of every drive in our datacenter through the standard SMART interface, and we’ve observed in the past three years that: 1) hard drives in pods in the top of racks run three degrees warmer on average than pods in the lower shelves; 2) drives in the center of the pod run five degrees warmer than those on the perimeter; 3) pods do not need all six fans—the drives maintain the recommended operating temperature with as few as two fans; and 4) heat doesn’t correlate with drive failure (at least in the ranges seen in storage pods).
One important note: Because all of the parts (including drives) in the Backblaze storage pod come with a three-year warranty, we rarely pay for a replacement part. The drive manufacturers take back failed drives with “no questions asked” and send free replacements. If you figure that storage resellers, such as NetApp and EMC, tack on a three-year support fee, a petabyte of Backblaze storage costs less than their support contract alone. A chart below takes all of our experience into account and shows what it costs to own and maintain a Petabyte of storage for three years:
In the chart above, the economies of scale only kick in if you really do need to store a full petabyte or more. For a small amount of data (a few terabytes), Amazon S3 could easily save money, but the Amazon option is clearly a dubious financial choice for a company with large, multi-petabyte storage needs.
The Backblaze storage pod is just one building block in making a cloud storage service. If all you need is cheap storage, this may suffice. If you need to build a reliable, redundant, monitored storage system, you’ve got more work ahead of you. At Backblaze we’ve developed software that manages and monitors the cloud service, proprietary technology that we’ve developed over the years.
We offer our storage pod design free of any licensing or any future claims of ownership. Anybody is allowed to use and improve upon it. You may build your own cloud system and use the Backblaze storage pod as part of your solution. The steps to assemble a storage pod, including diagrams, can be found on our original blog post, and an updated list of parts is provided below in Appendix A. We don’t sell the design, so we don’t provide support or a warranty for people who build their own. To all of those builders who take up the challenge, we’d love to hear from you and welcome any insights you provide about the experience. And please send us a photo of your new 135 Terabyte pod.
Appendix A – Price List:
Hitachi 3TB 5400 RPM HDS5C3030ALA630
Zippy PSM-5760 Power Supply
Available in qty of 9 for $47 from (CFI Group) CFI-B53PM 5 Port Backplane (SiI3726)
Syba PCI Express SATA II 4-Port RAID Controller Card SY-PEX40008
Mechatronics G1238M(OR E)12B1-FSR 12V 3-Wire Fan
Crucial CT25672BA1339 2GB, DDR3 PC3-10600 (4x 2GB = 8GB total)
Western Digital Caviar Blue WD1600AAJS 160GB 7200 RPM
FrozenCPU ele-302 Bulgin Vandal Momentary LED Power Switch 12″ 2-pin
Newegg GC36AKM12 3 Foot SATA Cable
Fastener SuperStore 1/4″ Round Nylon Standoffs Female/Female 4-40 x 3/4″
Aero Rubber Co. 3.0 x .500 inch EPDM (0.03″ Wall)
Vantec VDK-PSU Power Supply Vibration Dampener
Acoustic Ultra Soft Anti-Vibration Fan Mount AFM02
Acoustic Ultra Soft Anti-Vibration Fan Mount AFM03
Small Parts MPN-0440-06P-C Nylon Pan Head Phillips 4-40 x 3/8″
House of Foam 16″ x 17″ x 1/8″ Foam Rubber Pad
Custom wiring harnesses for PSU1 and PSU2 (the Zippy power supplies):
See detailed wiring harness diagrams.
SiI3726 on each port multiplier backplane to attach five drives to one SATA port.
SiI3124 on three PCIe SATA cards. Each PCIe card has four SATA ports on it, although we only use three of the four ports.