Covering Scientific & Technical AI | Sunday, January 19, 2025

GreenBytes Cuts Data Down to Size 

As data growth continues its exponential climb, companies are looking for ways of reducing the storage burden. Why not start by eliminating the duplicates? 

We've all heard the forecasts. IDC estimates that the digital universe will reach 40 zettabytes by 2020. According to IBM, 90 percent of all the data in the world has been generated in the last two years. Yet much of this data is in duplicate form. A 2010 IDC study found that nearly 75 percent of stored enterprise data is a copy.

As data growth continues its exponential climb, companies are looking for ways of reducing the storage burden. Why not start by eliminating the duplicates? That's the message of GreenBytes, the Providence, RI-based company that specializes in energy-efficient inline deduplication.

GreenBytes is on a mission to reduce the amount of data in the world. Founded in 2007, the company has since developed and patented its unique approach to eliminating duplicate data from a file system.

"What we enable companies to do is to dramatically reduce the amount of data they store without actually reducing the amount of data they store," says Steve O'Donnell, Chairman and CEO.

GreenBytes' patented smart software analyses storage blocks as they go from the computer to the storage device. In one machine cycle, the software determines whether the block is already stored or not. If it is stored, it does not get replicated. If it's not stored, it gets written to disk.

The technology relies on metadata and pointers. As customers store data to the storage devices, GreenBytes' software looks for duplicate data in real-time. If there are five copies, what gets stored is one copy with five pointers to the original. This has a dramatic difference on the amount of storage that is ultimately produced, notes O'Donnell.

For some kinds of applications, a 50 to 1 reduction is possible, says O'Donnell – only 2 percent of the total data would be stored. Even with use cases of 4 to 1 or 5 to 1, the amount of storage required is reduced significantly.

Among the competitors, most use post-processing, notes O'Donnell. The data is stored in the storage controller and in the background a computer program searches for duplications. This approach uses a lot of electricity and compute power. The GreenBytes method is real-time; deduplication is performed before the data lands on the storage device.

The second focus of GreenBytes is moving away from magnetic media, i.e., spinning disks, which consume large amounts of electricity, toward solid state drives (SSDs). A typical disk might consume as much as 18 watts per hour. Over a five year lifetime (18 watts * 5 years * 365 days) that comes out to 32,850 watts-hours – a hefty power draw.

SSD uses significantly less power, less than a 10th that of magnetic memory, reducing the overall power footprint of the storage system by a considerable measure.

Flash memory is desirable for being much faster and more energy-efficient in comparison with spinning disk, but those advantages come with a substantial price premium. At enterprise scale, when terabytes or petabytes of data are involved, the cost differential can be extreme. O'Donnell explains that the GreenBytes software changes these economics by reducing the amount of storage that is required by up to a magnitude. He claims that it's quite possible to use less energy and achieve better performance at a cost that is equal to or less than spinning disk without deduplication.

O'Donnell explains that GreenBytes' patented zero latency inline deduplication technology works especially well with virtual desktop computing, which is a counter-measure of sorts to the bring-your-own-device (BYOD) boom. Desktop virtualization creates a unified, and secure, desktop experience over multiple platforms. The drawback is added complexity – instead of only having to manage a limited set of internal PCs, the CIO now has to orchestrate a large-scale virtual infrastructure. The amount of storage that is required to support all these devices is often more than a company is prepared to spend. A business of about 1,000 employees with normal PC workloads requires somewhere along the lines of 40 terabytes of storage, which is actually a modest estimate when you figure the 1,000 users will average 40 GB a piece while most laptops today come with 500 GB.

O'Donnell compares the typical PC data footprint to a human DNA profile. From person to person, human DNA has large overlaps, areas that are exactly the same. The segments where DNA strands diverge to express eye color, hair color, etc., are only a small percentage of the overall dataset. A pool of PC users will generate similarly overlapping data sets. O'Donnell claims that deduplication in the average enterprise can reduce the data load by 98 percent. Now the CIO only needs two percent of the original storage allotment, bringing that original 40 TB requirement down to less than a TB. This means huge cost savings and huge energy savings.

While server consolidation via virtualization and other best practices has dramatically reduced the energy footprint of datacenters, data deduplication holds the same potential for the storage space. It's exactly the same kind of technology, says O'Donnell. Considering that the data generated by the Internet and corporate datacenters is growing at 40 percent a year, any technology that is capable of minimizing or even reversing this trend deserves serious consideration.

The benefits to customers are straightforward, says O'Donnell. Customers are able to manage the dramatic increase in demand for storage by turning data trends around to contain costs. Furthermore, the cost of solid-state and deduplication together are combining to become less expensive, particularly in the virtual desktop area.

Is there an area where deduplication doesn't make sense?

While data that is inherently deduplicated will not achieve any benefits, the use cases that support deduplication abound. Relational databases, files that are stored on user home drives, such as documents, spreadsheets and powerpoint presentations, email and virtual desktop - these are all good candidates for deduplication, according to the GreenBytes CEO. Even big data applications, information coming from various sensors and traffic monitoring systems can be deduplicated.

"Frankly the world is full of duplicate data," O'Donnell remarks. 'There are examples where it doesn't work, but in the vast majority of cases, we'll find it and we'll deduplicate it."

AIwire