This PIG – Predictive Information Governance – Is Starting to Fly
Historically, end users have been responsible for their own information governance. Even as organizations implemented and communicated information governance policies, it was still up to the end user to decide which documents needed to be retained, and then to follow through and move the document(s) someplace for safekeeping and long-term management.
This usually happened when IT advised you were out of storage and wouldn’t be able to save and/or send any more documents until something was deleted. Then a mass deletion took place, usually based on little more than the date or the size of the file, with little thought given to whether the file needed to be archived for business, regulatory and/or legal reasons.
Then about 15-20 years ago, vendors started offering full scale electronic “records management systems” designed to make information management easier. However, for the most part, these solutions didn’t really have the power to: 1.) decide whether a file needed to be kept for the above stated reasons; 2.) decide how long a file should be kept; and/or 3.) take action in moving the file someplace for long-term storage, protection and management.
Today, even the largest and most sophisticated organizations are still trying to hone (i.e., wrap their arms around) their information governance policies. And however cliché it is to say it, every organization - private, public, government, for-profit, or non-profit is facing an exponential explosion of data. The idea that end users alone (with a bit of software support) would be responsible for seeing that information governance policies are adhered to is neither realistic nor practical.
Given the potentially severe consequences of not doing so, most organizations have focused on perfecting regulatory and legal records management. But that data typically accounts for just 6-10 percent of an organization’s overall data. That leaves quite a bit of data in the hands of individual users to manage. And if we are being realistic here, that really means the majority of an organization’s data is not being managed at all (sticking it in a folder on your desktop, or in an email folder does not qualify as “management”).
The Holy Grail for information governance has become error-free, intelligent, automation that removes end users from the process. (btw, end users are fine with this, too – after all, who wants to add this responsibility to their ever growing responsibility list? For most, it certainly doesn’t fall under their core competencies, and they are likely not measured, paid or bonused on how well they assisted their organization with information governance come review time.)
During the Microsoft Inspire Conference first day keynote, Microsoft CEO, Satya Nadella, spoke about intelligent automation targeting information governance issues. He talked about intelligent cloud platforms, intelligent archiving, and predictive intelligence that could address data and system issues, before they occur. For instance, predictive intelligence could anticipate and make decisions regarding whether content is subject to compliance and/or legal mandates, how long it should be kept, in what location(s), limitations regarding access, added stipulations regarding security, etc… thereby relieving end users from responsibility and providing IT management with streamlined efficiency and ensured effectiveness.
Step #1: Machine Learning and Predictive Coding
Years ago, I worked in the eDiscovery industry where we successfully established predictive coding as a time saving and cost reducing technology to automate the process of first pass culling and review of eDiscovery data sets.
Prior to predictive coding, companies gathered huge amounts of potentially relevant documents based on simple keywords, and then paid teams of lawyers and paralegals to read and make a decision on each document. This could total millions or tens of millions of documents. As you can imagine, this process drove the cost of eDiscovery up. In fact, several years ago, the average cost of a single eDiscovery was approximately $1.5 million - not including the actual trial or judgments.
The most common machine learning technology used: supervised machine learning, which enabled those collecting eDiscovery results sets to train computers to recognize relevant content and “meaning” based on examples supplied to it. This supervised machine learning technology included iterative training cycles that provided feedback to the system as to its error rates (i.e., what documents it marked as responsive were correct and which were wrong). The number of training cycles could include 2, 10, 30, 50 or more… More cycles usually equated to a lower the error rate. For eDiscovery you wanted an error rate less of than 2%, as opposed to manual culling which could average 20% to 50%. The ability to deliver consistently low error rates meant that courts finally began to accept predictive coding as a reliable and acceptable tool for eDiscovery.
Step #2: Predictive Information Governance (PIG)
To start, there were a few companies that offered a semi-automated content intelligence, relying on black-box algorithms and massive software installations. While they were somewhat successful in recognizing and categorizing documents correctly, they still relied on individuals to train the software each time, and were still very expensive. The keys to broader adoption of PIG became obvious – unsupervised machine learning, and the cloud.
As the name implies, unsupervised machine learning – computers teaching themselves – removes the human factor. And once the error rates are low enough, truly automated predictive information governance is realized. The system can now automatically collect, categorize, store, protect, safeguard access, and apply retention/disposal rules.
Now, float this capability to the cloud. A public cloud platform that could provide this machine learning/predictive technology would further ease the management burden, as well as lower costs. In other words, the cost of this capability within the public cloud environment would be shared by numerous organizations, vs. just one organization having to deploy and manage a predictive information governance solution (and everything that goes with it).
Guess what? We are almost there! Microsoft’s Cloud and Azure platform and services are bringing the information governance industry within arm’s reach of the PIG Holy Grail. As a part of Azure, machine learning is now available to help organizations build advanced analytics and self-adapting security, among other things. And I believe (and hope), that automated data categorization and governance is not far behind.
Bill Tolson is a vice president at Archive360.