Originally published on VMBLog August 03, 2020
A couple of weeks ago, VMblog spoke with industry expert David Morris, VP of Product and Global Marketing at FalconStor, where we found out more about the company and its technology, persistent containers and the storage industry in 2020. As a follow on to that conversation, we again spoke with Morris, this time to learn more about the retention and reinstatement archival market.
VMblog: Last time you alluded to the retention and reinstatement archival market, what does that mean?
Morris: First, thank you for having us back. Yes, we did chat briefly about it. I ask for your patience and a bit of latitude to frame the answer. Let’s split the storage market into the operational side, where IT professionals manage active applications that run the business and short-term data backup and recovery processes that keep the business running. On the archival side, we transfer data ownership from IT operations to custodians that manage data in passive storage and long-term retention archives. Yes, we polarize the market into the two extremes with “active” data on one pole and “passive” data on the other for simplicity and emphasis, with an understanding of varied uses across the spectrum.
On the operational side, there has been a significant investment in new compute and high-performance storage technologies. Rapidly gaining value from data is a differentiator for businesses; however, the operational market is also a very competitive red ocean marketplace. On the archival side, there has been significantly less investment as it has traditionally been viewed as a cost center. However, three major growth areas will change the archival business dynamics and require new features and functionality for archived data, and they are applicable, whether in a data center or the cloud.
The first growth wave was deriving value from archival data and was the rise of data scientists, which is straightforward. The second growth wave was the expansion the compliance, regulatory, legal, and various privacy mandates and laws, which are continuously expanding their scope and purview to include emerging data types, increasing data volumes, and extending retention periods. The third growth wave for archival is the deployment of available data generating endpoints (IoT, IIoT, MIoT). Data archives will be filled by around-the-clock machine-generated information versus the limited human-generated data. Each of these growth areas is entwined in a virtuous cycle driving growth in the others. Existing archival paradigms and products are inadequate to meet the new archival demands.
The new archive will require retention policies and capabilities as standard features, which is half the battle. Throughout the archive lifecycle, the data’s validity must be assessed and recorded, which is the other half of the battle. The long-term archive will maintain, ensure, and verify at periodic intervals the confidentiality, integrity, authenticity, accessibility, nonrepudiation, and enforcement of the chain of custody throughout its lifecycle. The archive must present a secured and journaled record of all periodic integrity checks, data movement, access attempts, and managing custodians for audit or evidentiary review.
With double-digit growth, this accelerating market has a combination of new features and compliance and legal requirements that are significantly different from products available today and compelling enough to warrant a new market segment, connoted as the “Retention and Reinstatement” archival.
VMblog: Is retention and reinstatement archive needed across industries?
Morris: Yes, it is. Let me prattle a bit about the history and how we will get to the next end game. We are currently entering the third wave of data retention and reinstatement, which will be 10,000 to 100,000 times larger electronic storage archives than the first and second storage wave combined, which is a conservative estimate. Why did we not notice the first and second waves? They weren’t large enough for the average end-user and most companies to see unless you are in storage or a sizeable data-driven company. During the second wave, AWS was adding enough servers each day to support a $7B company in 2013 (James Hamilton AWS Re:Invent 2013). The third wave will eclipse AWS’s previous scale. If we believe our other friend, George Gilder, he calculates that the mega data centers of Google, Facebook, and AWS are near their theoretical limits. Perhaps, this is why AWS is offering micro data center pods to reside at a customer’s headquarters.
The first traditional data archives were the following: Oil and Gas keep their seismic data forever, as cities grow or territorial boundaries shift, they may never have another chance to conduct another study. In pharmaceuticals and biotechnology, the records’ retention period is up to a hundred years (for good reasons…Zombies), which is where EMC Documentum & OpenText did very well. Sarbanes-Oxley Act mandates records retention from five years to nearly forever for financial trading communications, again EMC Centera storage. Insurance, construction, movies and entertainment, and aircraft all have extended retention periods due to regulations or inherent data value. Until 2008, the data needed to be under retention was focused, well defined, and typically documents or electronic copies of documents. Collections were very targeted and were mostly done manually.
In 2008, electronic discovery (eDiscovery) emerged as a side benefit of the patent battle between Broadcom and Qualcomm. Subsequently, any electronic communication or data was discoverable in any legal proceeding, which opened the data collection floodgates. Email, email archives, messenger text, phone texts, computer hard drives, thumb drives, and more were collection targets. With electronic discovery software, data collection was near automatic, and savvy attorneys quickly learned every repository in the enterprise that potentially held evidentiary data, so that they could subpoena them.
In scoping some of the eDiscovery collection systems, initial discovery data volumes were often over one hundred petabytes in 2008, and a majority of it is still on legal hold today. With legal hold, storage companies get the bonus plan. The data corpus is collected without changing the data or the metadata. A Shaw hashing algorithm verifies it for data authenticity, and a complete copy is made and stored.
The third wave is machine-driven, and data creation will be nonstop across all industries, and data analysis will deliver focused competitive differentiation from agriculture to logistics to medicine with significant effects. And, this data will be co-opted for compliance, regulatory, legal, and privacy usage, as well as other non-intended uses. Data retention and reinstatement periods will continue to grow throughout the third wave to new data types as their historical, monetary, and legal merit come into focus.
VMblog: For retention and reinstatement, is the bar much higher than for operational backup and recovery?
Morris: The scrutiny and burden of proof that long-term data archival will demand is significantly higher than the existing standard afforded a backup and recovery copy, or traditional archive. Furthermore, its lifecycle is across decades into centuries versus weeks into months for a backup copy. Comparing Backup and Recovery to Retention and Reinstatement is the equivalent to apples to oranges juxtaposition.
VMblog: Is traditional disk archive and physical tape archive not feasible for the Retention and Reinstatement market?
Morris: No, they are not feasible, as retention and reinstatement features are an all or nothing proposition with long-tailed expense implications of archiving data for a half-century or a more extended period.
Accessibility of physical tape is challenging under tight timelines for production requests, or deletion requests, as well as the physical chain of custody of tape, is problematic. The need to refresh tapes every five to seven years due to tape degradation is expensive, time-consuming, and a security challenge. Historically, fifteen to twenty-five percent of the physical tapes degrade and are unrecoverable after the refresh time period.
Disk archive solves the accessibility and digital chain of custody challenges up to a point. However, the storage systems’ end of life is around seven to ten years, and then the data must be copied to a new storage system. Most storage systems entangle the data within the system, and it is very problematic to move data from one system to another without changing metadata, data, or retention policies, as well as verify that the copied data is identical. Historically, seamless portability across rival storage systems has consistently been a low priority feature for storage vendors and a top priority for storage customers. As we discussed last time, the nightmare scenario is the EMC Centera, a great product, and one where the data retention period outlived the product lifecycle and led to an expensive and time-consuming data migration challenge for customers (LMC Associates for more details in this challenge). This nightmare will be repeated, as retention periods are extending due to new demands.
As we look to the cloud, the hardware is typically abstracted or virtualized, so servers and storage can change with little impact within a single vendor. Most don’t implement a zero-trust model, so there are security challenges. S3 compatibility is the de facto standard to transfer data between clouds of different vendors; however, not all S3 clouds are created equal, and there are differences between cloud schemes that create problems. The top three cloud vendors are the largest purchaser of tape systems due to their low acquisition cost. Customers bear the total cost of ownership over the extended retention periods. The first big lawsuit where a cloud vendor loses ~15% of the client’s evidentiary data due to tape degradation or mishandling data will be interesting. Another factor for cloud customers is the inability to manage their archival expenses over time. Most companies initially viewed the cloud as an active and competitive market. For today’s customers, there is no easy way to move 10 or 100 petabytes of data from one cloud vendor to another while maintaining data coherency, as well as paying the data egress fees. Like the traditional storage vendors, cloud vendors prioritize data portability in line with their interests versus the customers.
VMblog: It’s been around forever, so why is there such a challenging problem with archives?
Morris: As we mentioned earlier, the archive is traditionally viewed as a cost center. There has been an overall lack of investment in archival technologies. With the three waves reinforcing each other, these growth drivers will leave many companies stranded with protracted retention mandates and 100s of petabytes of archival data on the one hand and cost-prohibitive, aging archival solutions on the other. With retention periods increasing from 10 and 25 years to 50 to 100 years, companies will need to actively manage their long-term total cost of archival, or the expense could quickly compound and sink many companies if new solutions are not available.
It is not just the technical challenges we must overcome with Retention and Reinstatement. It is much more complicated and will become more complex over time. Whether a compliance audit, legal case, or General Data Protection Regulation (GDPR), there are fines, sanctions, and even jail time can be levied for mishandling information over its lifecycle. GDPR leverages the awarded penalties to fund further GDPR investigations, which fuels more audits and litigation resulting in more archived data. The mandates and laws continue to expand with California, Nevada, and Brazil creating their own data protection regulations that individuals and corporations are directed to follow wherever they are physically located.
Today, the third wave of archival is upon us, and with the machines, there are no limits. One big thinker who has intimate knowledge of the third wave from an edge perspective is Mark Thiele, CEO & Founder at Edgevana Inc. There is increasing attention and scrutiny on data custodianship, and growing data integrity and authenticity demands. With the new extended lifecycle, retention requirements and policy enforcement are becoming standard practice, as data volumes increase faster than can be archived appropriately.
David Morris serves as VP of product and global marketing for FalconStor. He has more than 25 years of leadership experience in storage systems and storage, information, and compliance management. Before FalconStor, he worked with Cisco and Huawei to develop and define new strategic imperatives, as the third era of IT disrupted the technology sector. Recognized for his ability to identify new markets and develop targeted solutions, Morris has worked with private equity backed companies to position them into emerging high-growth markets, including Kazeon, acquired by EMC, and Cetas, acquired by VMware. At NET, he led the storage and network division turnaround efforts and repositioning the company to raise $85 million in a private placement of public equity (PIPE) and subsequent acquisition by Sonus.He holds graduate degrees in marketing from the University of California, Berkeley-Haas, in finance from Columbia University in the City of New York, and in engineering from George Washington University, as well as a Bachelors in Physics from Auburn University. He currently advises Aerwave, a next-gen security company, and Brite Discovery, a GDPR compliance and eDiscovery company. He is active in and supported Compass Family Services, which services homeless and at-risk families in San Francisco, The Tech Museum of Innovation in San Jose, CA, and The American Indian Science and Engineering Society. Published Monday, August 03, 2020 7:34 AM by David Marshall