The explosive growth of hard drive storage capacity over the past 20 years has enabled companies and users to store more and more data. Drives that were once measured in megabytes are now measured in terabytes — millions of megabytes. While this increased capacity has enabled the storage of new types of information such as video and audio, it also has enabled companies to compile staggeringly large stockpiles of data that are either in-and-of-themselves useless or useless on a de facto basis because they aren’t indexed in a way that makes them easy to use.
For many companies, data management regimes can be summed up in three words: buy more drives. As data builds up, they simply increase their storage and, in many cases, leave the old data intact. Over time, this leads to storage archives that are clogged with unnecessary data and that reduce productivity by making it harder for users to find the information that they need. Just as a pack rat’s home can reach a tipping point where it becomes uninhabitable, the same can happen with a data center.
To address this problem, the data management industry has come up with two solutions. One is to eliminate unnecessary information. The other is to take archived data and make it usable. In other words, data management experts help companies dump or use their archived data.
Dump It: Data Quality Software
Data quality software helps companies create and implement data retention procedures. IT managers establish policies for what data should be retained and what should be removed. The software then scrubs the company’s storage to determine which data fits the policy and which falls outside of it. Data that is within policy gets retained, indexed, or both. Data that is outside of the policy gets automatically deleted. This helps control the build up of unnecessary information and keep data storage facilities running smoothly.
Use It: Hadoop
Big data programs like the open-source Hadoop package have enabled companies to take a different course with large data repositories. These programs are designed to take extremely large quantities of data, which usually is measured in petabytes (1,000 terabytes), and quickly organize it so that it can be accessed and used strategically. With these programs, companies can can turn extremely large stores of data into useful resources for data mining and research. The nature of these programs is that they are better used for extremely large stores of data- such as a major company’s entire email history as opposed to being used for smaller pools of data like documents from terminated employees.
Ultimately, even indexing and organization aren’t enough by themselves. For companies to successfully manage their data, implementing data management regimes is still going to be a necessity. Whether this is done on an ad hoc basis or automated through data quality software, preventing the retention of useless information remains the key to keeping data storage systems functioning efficiently and quickly.