This report was prepared from the FujiFilm 2022 Summit. It reviews the astounding data growth in the archive storage space, some frank advice from Silicon Valley’s storage gurus to avoid eminent vertical market failure from outstanding growth in archival data, and innovations driving the future race to zero $/Gb, zero waste, zero carbon footprint storage.
The Zettabyte Era
By 2025, roughly 175 ZBs are projected to be created, 11.7ZBs will be stored, equivalent to 66 years of the Hadron Collider’s experimental data, 125 million years of 1-hour TV shows, 55.3 million LTO-9 cartridges, or 50 million 20TB Hard Disk Drives. If we project the annual 30% CAGR, we’ll enter the Yottabyte Era by 2043.
Archive data consists of 9.0ZBs of the total data stored, roughly 80%, using figures from the IDC. Traditionally, archival data has been an umbrella term for boring stuff: medical documents, corporate compliance documents, emails, oldies movies.
Look to the future: archive will refer to a spectrum of data, from active archive to dark archives, used to store everything from media in the film industry to AI/ML/IOT training data, accessed for a few weeks before moving back down to the lower archive tiers. The divisions between the tiers on the archive spectrum will vary based on the access frequency, with lower tiers growing larger as data ages.
Moving data between one layer and another is entirely dependent on the type of tightly coupled or loosely coupled tiered storage architecture. The beginning emergence of the storage tier is the golden copy of the data, whereas the lower tiers (3) are the master copy. 10% of the world’s data is stored in this golden copy tier, the highest performance tier whereas 80% of the data is low activity archival data in the lowest tiers. The greatest challenge is centered in the primary and secondary stages, where data moves from hot to cold. This region is dynamic: neither suitable for performance critical data nor long term data retention. By 2025, this model will shape as the active archive tier, data which is used for three to four weeks for high access and then moved back down to the deep archive.
Storage Medias: SSD, HDD, and Tape: What Stores Old Facebook Posts?
There are two media that serve the archive tiers: tape and HDD. Let’s discuss what their differences are. Note, SSDs are rarely used for archival storage. While SSDs are the highest performance storage media, with data access times between 25 and 100 microseconds, each time a read/write operation is performed (that is, encoding a one or a zero to a transistor) the transistor is damaged. As a result, SSDs are limited by the number of R/W operations which make them unsuitable for a long-term, archival storage solution. They are almost 10x more expensive that HDDs on a $/Gb metric. Tape and HDD serve the archive layer and the majority of hyperscale archival data (80%) is stored on HDDs.
HDDs are a storage media where ones and zeros are written to either 3.5″ or 2.5″ diameter magnetic disks spinning from 7,200RPM to 15,000RPM. As the disk spins, a read/write head flips the magnetic polarity of grains on the disk, where the direction of the magnetic field vector determines whether a one or a zero was written. At a datacenter, a 3.5″ HDD is mounted to a system for online access, such as a NAS (network attached server), JBOD (just a bunch of disks), or a server. There are only three original HDD manufacturers in the world: Toshiba, Western Digital, and Seagate, who are in a never ending battle to drive to the lowest $/Gb, highest areal density, and best access time–all pressured from supplier partnerships and a complex supply chain impacted by global politics (well, yeah, this is all global supply chains, is it not?).
Tape, on the other hand, is a predated storage media first developed in the 1960s but standardized in the 1990s by IBM under the LTO form-factor. The tape market is what business folk call a “consolidated market” because there are only two form-factors of tape media, both under the same jurisdiction: IBM. IBM’s parenting style is similar to a tiger mom in the storage industry. IBM manages their own tape form-factor, is the sole manufacturer of tape drives for both LTO and IBM form-factors, and insists on releasing their IBM cartridges at least two years prior to LTO’s. .
Tape drives are a system, and used to read and write to tape cartridges, and stacked together to form a tape library which can store up to half an exabyte. The benefit of using tape, especially for archive, is it’s ability to be disconnected from the network. Simply take out the cartridge, throw it in a box, and there is a physical “air-gap” between that data and the network. The historical approach to tape has been a use case as backup, hierarchical storage managers (HSMs) and media asset management. At some hyperscale operations (S3, Microsoft Azure, Alibaba), tape is used as the primary archive storage media, you can tell because access times can range from 1-12 hours due to the physical tape cartridge being stored in some box off-site in a separate location.
Total Cost of Operations (TCO)
A key metric for deciding between Tape or HDDs for a cloud operation is dependent not only on the access time, average storage capacity per unit, and the offline storage capability, but the Total Cost of Operations (TCO). TCO refers to all the cost associated with media utilization, from raw production to the end of life.
Consider a full-height LTO tape drive, the system used to read and record to a tape cartridge. The average energy usage is 0.031 kWh, with an average life cycle of 6.85 years. Offsetting costs of production, distribution and operational energy, a single LTO tape drive will produce 1.4 metric tons of CO2 per year. For storing 28 PB of archival data for 10 years in a tape based storage solution, 78.1 metric tons of CO2 will be produced using 14 LTO-9 Tape drives and 1500 LTO-9 cartridges using only one frame. That equivalent amount stored on HDDs in a JBOD would produce 1954.3 metric tons of CO2 over 10 years, using 18TB HDDs during the first five-year cycle and 36TB HDDs on the second cycle. Those figures then indicate 10 times more yearly energy consumed using an HDD based system over a tape based system (Brume, IBM)
Right now, you can purchase an 18TB native capacity LTO-9 tape cartridge for $148.95 with 1,035m (more than a kilometer) of magnetic tape inside. HDDs on the other hand, are a higher cost per unit ($529.99 for a WD 20TB SATA Gold) but the areal density (the number of bits per square inch of media) is 3-orders of magnitude higher than tape. Next-gen HDDs suited for archive will approach a whopping 36TB, and rumors have spread of the release of a lower performance 50TB HDD entering the market from Western Digital. These new HDDs will likely be used for the future of some of the archive storage market–specifically that first tier of the archive tier: active–but cannot be used for all archive data, especially as it gets older and colder.
It beckons the question: between HDD and tape, what are the best utilities? Where does energy efficiency become the highest concern? And what about data accessed once every 30 or 50 years in the dark archive?
The Deep Dark Archives
Recall that 80% of hyperscale’s archival data is stored on HDDs. Is this truly the best solution? I’m just an intern, but I say no. Here’s why:
If we assume the tiered archive model as valid where the probability of access, the access frequency, and the average value of the data determine the tiers, then the Deep Dark Archives should not be stored on tape nor HDDs. The data stored in the dark archives is data which has almost no value. Our conditions, therefore are: (1) near zero $/Gb (2) near zero carbon footprint (3) near zero product waste. Storage at a hyperscale datacenter accounts for 19% of total datacenter power, and moving cold data from HDDs to tape can dramatically reduce the ten-year CO2e while simultaneously reducing e-waste. There are also substantial TCO savings for migrating cold data to tape, and companies of all sizes are looking to improve sustainability for customer-facing storage.
The Media is the Data, the Data is the Media
It’s important to take a step back now, and recognize the issue at hand which I would go as far to describe as “hoarding”. We are collecting swaths of data, will continue to collect swaths of data, which will continue to demand storage media devices to store our swaths of data. This cycle is wasteful. Is there a solution?
— to be continued…