Image depicting organized computer data or organized file archi

An Introduction to Data Cleanup

By: Linda Coniglio, Information Governance and Privacy Specialist, CIPP/US

There is a term that’s used to describe the large, hard-to-manage volumes of data – both structured and unstructured – that inundate organizations on an increasing basis – Big Data. As evidenced by the growth of data in the world and the variety of sources creating that data, Big Data has become a huge organizational challenge.

Growth of Data

According to StatInvestor, the amount of data created in the world has grown from 0.1 zettabytes in 2005 to 47 zettabytes in 2020 and is expected to rise to 163 zettabytes by 2025.

Some have estimated the amount of data created by 2025 to be as much as 175 zettabytes, and most of this data is unstructured data, growing at 55 to 65 percent per year. One zettabyte is equal to a billion terabytes (TBs), or a trillion gigabytes (GBs), so that means by 2025, we may be creating as much as 175 trillion GBs! And according to Raconteur, the volume of data created daily by 2025 will be 463 exabytes (each exabyte is 1 billion GBs)!

Volume of Data Created Worldwide from 2005 to 2025 (Source: StatInvestor)

What is a Zettabyte?

A zettabyte is a measure of digital storage capacity. A zettabyte is read as the 2 to the 70th power bytes. It is also equal to a thousand exabytes, a billion terabytes or a trillion gigabytes. Simply, it would mean one billion, one terabyte hard drives would be needed to store one zettabyte of data.

Due to the zettabyte unit of measurement being so large, it is only used to measure large aggregate amounts of data. Even all the data in the world is estimated to be only a few zettabytes.

1 kilobyte (KB)   =   1,000

1 megabyte (MB)   =   1,000,000

1 gigabyte (GB)   =   1,000,000,000

1 terabyte (TB)   =    1,000,000,000,000

1 petabyte (PB)   =   1,000,000,000,000,000

1 exabyte (EB)   =   1,000,000,000,000,000,000

1 zettabyte (ZB)   =   1,000,000,000,000,000,000,000

In addition, a useful analogy for understanding data volume size is equating one byte of data to a singular grain of rice. Here is what that volume looks like if we apply it to larger units of measure for data sets:

  • Byte: one grain of rice
  • Kilobyte: handful of rice
  • Megabyte: Big pot of rice
  • Gigabyte: Truck full of rice
  • Terabyte: Container ship full of rice
  • Petabyte: Covers the Province of Nova Scotia (55,284 km²)
  • Exabyte: Covers Western Canada
  • Zettabyte: Fills the Pacific Ocean (660,000,000 km3)

Growth of Data (cont.)

To provide a frame of reference, the EDRM model was created in 2005. Do you think the creators of the EDRM model realized that the volume of data that the model would need to support would grow more than 1,630 times in 20 years? It’s no wonder that the Information Governance Reference Model (IGRM) has evolved to become such a large presence within the EDRM model over the years!

And the cost to store all this data adds up considerably. According to Gartner, the annual cost per raw TB of storage is $2,520 and the “fully loaded” cost (including hardware, maintenance, FTE(s), power & HVAC and connectivity) per TB is $3,351. Some estimates are as large as $9,555 per TB fully loaded! Those are just the storage costs per TB – not including any costs associated with business functions involving that data.

Variety of Data Sources

It’s no wonder why data is growing so much in organizations – it’s coming from a larger variety of data sources than ever and a lot of it is happening every minute of every day. The Internet Minute infographic for 2021 illustrates what happens in a typical minute just on the internet, including emails, text messages, social media activity and chat messages.

The variety of data sources significantly contributes to the Big Data challenge within organizations, putting pressure on them to address these data sources– most of which are unstructured– in information governance and eDiscovery.

What Happened in a Typical Internet Minute in 2021 (Source: and Lori Lewis)

Redundant, Obsolete and Trivial (ROT) and Dark Data

Unfortunately, much of that significant increase of data in organizations is either data that the organization doesn’t need or isn’t using effectively to derive insights or make decisions that are business critical. This data falls into two categories: Redundant, Obsolete and Trivial (ROT) Data and Dark Data.

ROT data in an enterprise is comprised of:

  • Redundant data is duplicate data stored in multiple places within the same system or across multiple systems.
  • Trivial data is information that does not need to be kept, because it doesn’t contribute to important business objectives, corporate knowledge, business insight or record-keeping requirements.
  • Obsolete data has outlived its useful purpose.
  • Dark data is data which is acquired through various computer network operations but not used in any manner to derive insights or for decision making.

According to a recent report from Veritas, 85% of stored data is either ROT or dark data. That means only 15% of data in organizations is classified as business-critical information! The ability for an organization to identify ROT and dark data quickly and effectively is vital to its ability to address today’s Big Data challenge, but it’s increasingly difficult to do so without leveraging technology as part of the solution.

How can the storage costs for ROT and Dark Data add up?

Here’s a scenario involving a healthcare company of 20,000 employees with an average data per employee stored of 27.5 GB (which is a reasonably modest amount). The total data stored for the organization in year 1 would be 537 TBs. Assume a ratio of 30% ROT data, 50% dark data and 20% useful business data and a data growth rate per year between 60 and 65 percent. Here is what the total data stored for the company could look like over 5 years if no data is purged:

Data Growth in Terabytes Over Five Years for Large Healthcare Company

By the fifth year, the company would be storing over three petabytes (3,000 TBs) of ROT and dark data! Based on the Gartner fully loaded cost of $3,351 per TB, here are the storage costs over those five years:

Annual Storage Costs Over Five Years for Large Healthcare Company

By year five, the storage costs alone for ROT and dark data would be over $10 million! And the total storage costs for ROT and dark data over five years would be $24 million! That doesn’t include additional costs associated with the use and processing of data to support eDiscovery, privacy requests and other business needs.


An effective information governance program that leverages best practices and technology such as Innovative Driven’s proprietary IDentify File Analysis tool to locate and remove ROT can keep data minimized, enabling organizations to reduce costs and reduce the risk of exposing sensitive data. IDentify puts the power in your hands, allowing the data reviewer to determine what should be retained and what should be deleted.

Download the Full Guide: IDentifying and Eliminating Data Hoarding in Your Oganization

Previous PostNext Post