Unstructured Information Management – What You Don’t Know Can Hurt You!

Originally published November 06, 2008

Companies large and small create an impressive amount of data, including email messages, documents, and presentations. Most of that data is unstructured, existing primarily on corporate file servers, employee desktop, and notebook computers. Industry analysts estimate that this unstructured data accounts for 80% of all corporate information, and expect it to grow 50% or more each year.
Unstructured Information is Unmanaged Information Unstructured data is typically unmanaged. The file system on which this information generally resides is not monitored, and the content is practically invisible to employees, auditors, or corporate compliance officers. To provide a greater degree of visibility, control, and management of this information to meet compliance reporting requirements, companies have implemented one or more technologies, each of which has advantages and disadvantages:
Enterprise search? An enterprise search engine is an effective way to index and find documents that contain certain terms. Most are easy to implement and require only a modicum of regular maintenance. Unfortunately, most enterprise search engines are tuned to find all the documents that may contain a particular term, rather than a specific document that may be required by an auditor. It is left to the user to winnow through all the returned documents to find what they need, which can be a time-consuming and costly exercise. Additionally, search engines are mostly lacking in providing the ability to manage the documents it indexes.
Enterprise Content Management? ECM systems can effectively manage many types of content and provide access and version control, both of which are effective aspects of information management. ECM systems also tend to be very expensive to set up and maintain. These systems typically require an organization to purchase server and user licenses, implement policies and processes for using the system, and train its users. Because of these costs, companies often limit their ECM implementation to specific areas of their business or types of data, such as documents pertaining to finance. According to many analyst organizations, ECM systems are being used to manage approximately five percent of today’s corporate information.
File Backup? Many companies attempt to solve document retention by creating regular backups of all the data on the network. These backups are saved to tapes, which are then stored offsite for disaster recovery purposes. Backing up all data regardless of its business value is an inefficient use of time and resources, increases the cost of tape storage, and decreases the likelihood of rapid single file recovery, which is the most-used aspect of file backup.
Doing nothing? This is the solution that many companies choose to handle unstructured information. Unfortunately, the prevailing thought among many has been that unstructured information is insignificant and therefore, does not require management. After all, most of this information ranges from personal files to draft documents or one of the dozens of copies of sales presentations, the majority of which aren’t worth the cost required to manage them.
While most files aren’t worth managing, the risk comes from the small number of files that do matter. For instance, your Sarbanes-Oxley policy and procedure manual, which took valuable internal resources, a consulting firm, and many months to create, has likely been copied from the content management system specially created for finance-related documents. The next time you update that manual with critical information, you have fulfilled one aspect of the act by tracking and recording those changes in your records management system. However, what about the dozens of copies that may have spread across the network on shared file servers? How can you be certain those copies are deleted or updated to keep people from following old procedures or controls? If you aren’t doing anything to manage that data, you are leaving your company exposed and vulnerable.
Recognizing Valuable Information
Addressing these issues is key to an effective solution for Sarbanes-Oxley or any information governance initiative. Obviously, doing nothing is not the answer. At the same time, it would be cost-prohibitive to manage all files as though they were critical business records. Therefore, the ability to specify which data is critical and worthy of this management level is a crucial first step. If you are aware of the data?s value, you can make educated decisions about the disposition of important data and create an appropriate retention policy.
Determining a data’s value is a result of effective information visibility and control.
Information Visibility
The first aspect of recognizing valuable data requires that it be visible. While your compliance office may have access to all corporate information across the network, the sheer amount of data necessitates the use of technology to find and manage the appropriate documents.

Information Control
To effectively manage and control unstructured information, you need a solution that allows you to copy, move, delete, or tag documents with custom metadata, i.e., information about the document. Even better, the solution should provide an integrated policy engine that can be customized with your company’s information governance regulations. For instance, creating a policy mandate that any document on the employee network that contains a customer account number must be 1) tagged with custom metadata of Customer, and 2) moved to a secured server or file archive system.
Data classification is an important aspect of information visibility and control. Several products have emerged or expanded into this space to offer an all-embracing solution for complying with Sarbanes-Oxley and other regulations. By implementing one of these data classification systems, documents on your network can be located, opened, and tagged according to the content found within each document. A typical classification workflow might look something like this:

  1. Catalog: The system scans the file systems, finding and collecting file metadata from hundreds of file types.
  2. Classify: Opening each document, the system classifies data according to file attributes and keywords or word patterns, and tags with custom metadata according to pre-set policies.
  3. Search: The system allows users to find desirable information based on a combination of metadata and full document text, utilizing standard Windows and UNIX access control lists.
  4. Report: The system should allow appropriate users to create and access summary or detailed reporting functionality.
  5. Act: Finally, the system should integrate actions, such as tagging files with custom metadata, setting retention and monitoring policies, and offering move, copy and delete functionality, again based upon an access control list.
    In contrast, an enterprise search engine provides an efficient method to find content that contains the search term you need. But then what? If you wanted to copy, move, delete, or perhaps tag the document with customized metadata, you would have to do so manually.
    Data Retention, Availability, and Recovery Retention is another aspect of corporate information that cannot be overlooked. While many companies elect to back up all data on a weekly or monthly basis, the cost of time and resources increases as the amount of data grows. Knowing what is in your data by making information visible, by tagging with metadata and controlling access allows you to intelligently create a retention policy that moves or backs up only the data needed to comply with your corporate information governance policy or government regulation.
    Most organizations use a backup solution that periodically copies data to tape or disk drives. An organization may back up its mission-critical data every night and all of its data every week. It may store the backup tapes for up to six months to guard against accidental deletions, send tape copies offsite as a safeguard against disaster and retain backup tapes long-term to meet regulatory requirements.
    Lacking the means to gauge the value of the data, companies often take the safe route and back up all of it. Not only is the approach ineffective, it indicates inefficient data management and creates a potential risk. Storing data that is not required to be kept can be used against a company in the event of a lawsuit or regulatory compliance issue. In this respect, backing up data in its entirety creates a liability.
    Corporations can meet regulatory data retention requirements, cut backup and recovery costs, and manage risk by introducing file archiving into the mix.
    A file archiving system uses data classification to determine the content’s value, then moves or copies files according to that value. File archiving systems can find and retrieve files based on their content. Any number of parameters can be used, including author, date, and customized tags such as SEC 17a-4, or Sarbanes-Oxley.
    This naturally leads us to the tiering of storage services. Backup and file archiving are natural places to start for providing tiered storage services, based upon the value of the data in your network.
    As an example, consider a company that has 10 terabytes (TB) of data on production file servers. In the past, the company may have backed up critical files onto disk storage and then backed up all files onto tape once a week. The company cataloged the tapes, kept them for three months, and then cycled them back through the process. New government regulations mandate that all data related to quarterly financial results must be kept for five years. Unfortunately, the company has no way to differentiate among the disparate types of data on its network. The company is forced to retain all of the data for five years, expanding the amount retained from 10 TB to 2.5 petabytes (PB). As data amounts double annually, so will the amount that must be stored. The company will find itself devoting more and more time and resources to data backup.
    To solve this problem, let us assume that the company implemented a data classification system. By discovering the value of its unstructured information and tagging according to the value, the company copied 500 GB of financial reporting data to WORM storage for long-term retention and moved seven TB to tiered storage, which is backed up to tape every three months. The data in three-month storage would total 42 TB, compared with the 2.5 PB that would have been required if the data had not been archived. With tiered storage, the company significantly reduced backup time and resources, shrank the cost of production file storage, and increased its IT service levels by freeing personnel and data for other tasks.
    Tiering your data storage services allows you to put SOX controls only around the data that pertains to your financial information and lock down the appropriate data on compliance-specific storage boxes.
    Proving Compliance
    The old adage is true: the best defense is a good offense. In the case of Sarbanes-Oxley compliance, the best offense is to create and implement provable policies. Having a data classification system allows you to produce standard reports that show duplicate copies of applicable documents, that show who has accessed the file within a specific time period, and that monitor implementation of your information governance policies. With reporting functionality available in a dashboard implementation, you can think of your system as a burglar alarm: a deterrent to potential wrongdoing and a way to prove that you’re actively checking for compliance-related issues.
    Best Practices
    Implementing one of today’s data classification systems should be an integral part of your Sarbanes-Oxley best practices. Setting information governance policies fulfills an essential requirement. Active management of your unstructured data will find, tag, and move content according to your corporate policies, lowering the risk that information will fall through the cracks, and potentially protect you from breaking the law. Creating a tiered storage system will allow you to set retention policies according to the value of the content, saving money, and reducing risks. And proving compliance or at least show that you’re attempting to comply is sometimes the best way to meet and exceed current and future government regulations around financial systems and employee and customer privacy.
    Reducing Risk and Lowering Costs
    In the end, the benefits of visibility and control of your unstructured information reduce risks of compliance violations, litigation exposure, untimely responses, and privacy and security breaches, and lowers costs through streamlined storage operations, improved service levels, and automated policy-driven data management.

David Morris – Morris Bytes

Published by morrisjd1

David Morris is a technology and business executive with 20+ years of management & high-growth experience in both startup & public companies. His experience spans technology development & innovation, business strategy & management, corporate & business development, engineering, & marketing roles. Recognized for his ability to identify new emerging markets, develop targeted solutions, and create accretive strategic imperatives, David has worked with and advised private equity backed and public companies to position them into high-growth markets, including Kazeon, acquired by EMC, and Cetas, acquired by VMware. With a reputation as a technology thought leader and evangelist through blogs, articles, and speaking engagements, he had advised numerous companies on emerging technology market trends and the impact of disruptive technologies on existing busines models. David has founded two companies, launched six (6) companies, had two (2) successful public successful turnarounds. His technology experience is across compute, networking, storage, compliance, eDiscovery, SaaS, IoT, cybersecurity, Linux containers for DevOps & Storage, & AI solutions. David holds graduate degrees in Marketing from the University of California, Berkeley-Haas, in Finance from Columbia University in the City of New York, and in Engineering from George Washington University, as well as a Bachelors in Physics from Auburn University. He currently advises Aerwave, a next-gen security company, Loop, and Brite Discovery, a GDPR compliance and eDiscovery company. He is active in and is a long time supporter of Compass Family Services, which services homeless and at-risk families in San Francisco, The Tech Interactive in San Jose, CA, and The American Indian Science and Engineering Society. In his off time, David enjoys cycling, weightlifting, and scuba diving (especially in Belize). LinkedIn: https://www.linkedin.com/in/jdavidmorris

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: