Originally Published on September 20, 2020 – https://www.linkedin.com/pulse/100-year-archive-storage-dilemma-exploratory-essay-david-morris/
by David Morris, VP of Product and Global Marketing – FalconStor
The harbingers of the future of archival storage are ensconced in the emerging digitalization of information and the ever-expanding compliance, regulatory, eDiscovery, privacy, and historical data retention mandates. On the media side, the National Association of the Storage and Networking Industry (SNIA) works to resolve the “100-year archival dilemma” through existing standard technologies. The Active Archive Alliance investigates online access, searchability, and retrieval of long-term data, and explores the enablement of virtually unlimited scalability to accommodate future growth with growing retention mandates. With retention periods extending to over 100 years and historical data retention extending into perpetuity, data accumulation is accelerating over time, and archiving and data custodians will play an increasingly important role. Given the pace of data accumulation, the archive is growing to be one of the most valuable assets in the storage and networking industry to retain data throughout its lifecycle.
Data ownership is transferred from the IT operational side, which leverages active or operational data in the short-term, to archival data custodians on the archive side, which manages data retention, integrity, and accessibility over the long-term. The custodian’s mission is to preserve data and actively maintain data integrity of the data assets under management throughout its lifecycle, which needs to be designed to last a hundred years and perhaps much longer periods. With data growth, the archive custodian job will be the high growth job opportunity for decades to come. FalconStor bifurcates the storage market into Operational Data and Archival Data to separate data ownership over the data lifecycle: see the below chart.
Historically, storage vendors have designed storage platforms with little consideration for data longevity, data migration, or cross-platform interoperability. The “system-centric approach” was rational in the early days, as storage companies competed to design differentiated platforms to win customers. Interoperability and data migration has been a nonexistent priority, as vendors worried that easy data migration would lead to customers leaving their platforms for other companies’ storage platforms. Today, the storage drivers and new requirements have significantly changed the earlier viewpoint and approach. It is no longer relevant or practical to focus on a hardware system that sequesters data and data accessibility within the confines of a system’s short lifetime.
The end of a storage system’s life, depending on the system, is between five to ten years, and then the data must be copied to a new storage system. Since data has typically been entangled with the storage system, copying data while maintaining its integrity can be a very time-intensive and costly exercise. The time and cost increase significantly if there are compliance and legal retention features that must be maintained during the migration process to a new system.
As the storage industry advances, what is important is the data, its longevity, portability, integrity, and accessibility. This viewpoint leads us to break from the short-lived “system-centric” approach and reassess, reconsider, and rearchitect the storage approach because it is the data that has always been the invaluable asset, not the storage system.
Numerous vendors espouse that their tape system or their hard drives will be the medium to store data on over the next 100 years. Since long-term storage is the challenge of the day, the challenge is to store data on a storage medium that lasts as long as possible. But, tying data to a specific short-lived hardware system is irrational. Components like spinning hard drives are already in obsolescence, as it is challenging to find a laptop without a Solid-State drive today. Another contender is tape. Tape’s claim to fame is that it is the lowest cost medium. In most cases, the lowest cost claim is valid, along with the lowest accessibility, and one must also accept that tape has the highest failure rate. Tape systems have the most attestation letters for storage medium failure submitted to compliance and judiciary bodies for data loss than any other medium. Although tape meets today’s minimal retention mandates that the law requires, existing regulations and laws for data retention and reinstatement will change as the compliance and judiciary bodies understand that other storage methodologies are much better for long-term storage survivability than tape, individual hard drives, or traditional storage systems. With today’s technologies and methods, data loss attestation will become unacceptable, as did the lack of electronic discovery became unacceptable in 2008 with the Qualcomm v Broadcom lawsuit. The medium alone is not a solution, as it is a sub-argument of our flawed system-centric approach. A traditional system-centric approach will not take us into the future growth of archival. Sans a new long-term storage technology development, we need an alternative approach to bridge the gap.
If we start a thought experiment with the need to retain data for 100+ years and work backward, it is apparent that the existing system-centric approach would demand we copy information to new systems ten to twenty-five times over the data retention lifecycle. We would run into human errors, data and operating system incompatibilities, and hardware incompatibilities. We would also have challenges in data accessibility and application availability. Does the application or database needed to access the data still exist in 50 years? Likely not. Even if the application vendor still exists in 100 years, the application and its architecture will have change sufficiently enough to be incompatible with 50-year-old data, much less 100-year-old data. This thought experiment leads us to understand that we need a tandem data and application strategy to access data in the future. Today’s continuous development and continuous deployment software coding methodology lead us to the need to have the data and application virtualized from the underlying operating system and hardware. They both are changing exceptionally rapidly over time, such that the stored data and application would have OS and Hardware dependencies that quickly move into obsolescence, which would make the application and data inaccessible in a relatively short time.
Furthermore, the storage method for both the data and the application needs to be futureproofed for a sufficient time, so that the data can be reinstated easily. The technology would need to provide a robust, time friendly API that is backward compatible, so the data and application can access and leverage new operating systems and hardware versions on which to execute. Lastly, we need to have a tiered capability that can migrate data as its value and accessibility requirements change over time to keep its retention cost inline with its value. Data migration must be seamless across storage systems and Clouds, and its migration should not affect the data or its metadata.
Upon consideration of these arguments, a new architecture is needed for storing and archiving data that obviates the system-centric and medium-centric drawbacks. If we invert the problem and approach it from the opposite direction, a new paradigm emerges. Instead of a “System-Centric Approach,” it opposite is a “Data-Centric Approach.” The data is a valuable asset and has an extended lifecycle that far exceeds any storage system ever developed or will likely be developed in the foreseeable future.
Leveraging Linux-based containers to deliver a “Data-Centric Approach” was developed. It breaks traditional data storage limitations and enables enterprises to store archive data across data centers, public, private, hybrid, and multi-cloud data storage environments.
Leveraging industry-standard software container technology, the container uses virtualization at the application layer versus the systems layer. The approach allows it to disaggregate data storage and applications from the system-level components and operating system. The solution delivers a persistent, long-term data preservation container that is agnostic, heterogeneous, and highly portable. Leveraging the growing inertia of the Open Source community and Containers, StorSafe is backward compatible and compatible with future technology advancements for the foreseeable future.
The container technologies’ runtime environment has also allowed the development of robust active and passive features and functionality, allowing applications to be stored in tandem with its data. Storing both the application and its data within the same container maintains future accessibility to both the application and its data, as it was on the day it was archived. As retention and reinstatement (see VMblog Expert Interview: David Morris of FalconStor Answers Five Questions on the Retention and Reinstatement Archive – The New Storage Subsegment) requirements become the standard, the execution capability will be necessary to meet the integrity, validation, and journal-based auditing requirements needed during extended retention periods in both data centers and cloud environments.
The “Data-Centric Approach” starts with a data-first ideology with 100-year data longevity as a given, which has led to a new approach. It is an approach that obviates the system challenges, medium challenges, and the migration challenges currently hampering industry-wide efforts to solve the ever-growing and compounding archival data challenge. The “Data-Centric Approach” is the first feasible long-term data storage and archival approach, which is not predicated on the traditional monolithic and or semi-monolithic ideology. The data and application tandem capability that containers offer ensures the ability to maintain access to the data and application in a most elegant manner.