Covering Scientific & Technical AI | Sunday, December 1, 2024

Hybrid Cloud File Storage: Bringing File Data into the AI Age 

via Shutterstock

Today, there are more vendors and solutions than ever for IT infrastructure buyers to choose from. Despite the confusion this causes, decision makers are certain of two things:

  1. In the next five years, their infrastructure will resemble a hybrid cloud.
  2. They need to start building tools, processes and entire businesses powered by AI and machine learning before a digital-native competitor disrupts them.

Until recently, enterprise-ready file storage in the cloud simply didn't exist, so customers have been held on-prem hostages with their legacy file storage vendors lacking the business models and technological know-how to build cloud file storage. (The good news is that a new generation of file storage is becoming available that embraces the best features of object storage while continuing to provide the speed and versatility of file [more on this below].) But as things generally stand, current realities have resulted in legacy vendors tricking the market to think AI/ML is only about ultra-high performance model training use cases, where massive amounts of data need to be processed via a converged solution consisting of compute, storage and GPUs in a single product.

While these use cases certainly exist, only a small minority of AI/ML practitioners, such as data scientists and engineers, need to worry about them. Instead, most people building and training models do so in the public cloud with object storage where they can take advantage of the following attributes:

  • Global scale and distribution

Organizations are global by default now, and their data lives all around the world. Cloud based object stores like AWS S3 do a great job at making it trivially easy for a company's business units to access and manage data all from a single, easy to use interface. Data scientists love this because AWS S3 acts as one single source of truth.

  • Custom metadata search

Also making object stores great for building AI/ML powered tools and businesses is the ability to tag objects with custom searchable metadata. From the data scientist's perspective, they can "ask" object storage a question, and instantly get an answer back almost as if it were a database.

  • World class SDKs and APIs

Querying object store for custom metadata isn't done through a file protocol like SMB or NFS, instead most object stores have implemented some version of AWS's S3 protocol. The S3 API and SDK make it trivially easy for machine-generated data to securely store data in object stores.

For all of these reasons, data scientists have viewed traditional file storage as the albatross around their neck, preventing their organization from moving fully into the cloud. To them, file storage is a black box, which cannot be quickly queried and hides important information.

Take, for example, an insurance company looking to use AI to help underwrite insurance claims. To do this, a data scientist would first have to find all the historical claims data as well as the payout data for each claim. In an ideal world, a data scientist at an insurance company could easily find the claims data stored on a NAS and start comparing it with the payout data stored on another NAS in a different data center. But all too often what should be a one-hour task takes weeks. Data scientists have difficulties finding where data lives, and then even more difficulties getting permission to access it once its location is found. Furthermore, once the data is finally obtained, the data scientist might spend additional days moving the data to the cloud where it’ll be processed and analyzed using a public cloud vendor’s tightly integrated AI/ML toolkit.

This needs to change. Organizations committed to innovating through AI-powered tools, products, and businesses need file storage which benefits from and extends the innovative features of object storage while continuing to provide the speed and versatility of file storage. To the practitioners building these new AI-powered experiences, file storage needs to have the following attributes:

  • Runs in the cloud

The success of GitLab, InVision, and other remote-only unicorn companies signals a shift in how work gets done, and it's only a matter of time before remote-only work goes mainstream in corporate America. When organizations don't own massive physical buildings anymore, neither will they own massive physical data centers. Instead, all their compute and storage will reside in the public cloud. Most of the data generated by these globally distributed teams generate being files, enterprise-ready file storage in the public cloud becomes critical to the success of these new organizations.

By putting all file data into the public cloud, an organization solves its data silo problems for data scientists and engineers building new AI-powered experiences. Instead of on-prem data silos scattered around the world, data scientists need an easy to use interface enabling them to see where all the organization's file data lives. File storage running in the public cloud gives them exactly that.

  • Full S3 protocol support - including custom metadata

The reality is the S3 protocol has become a new standard and should be viewed like SMB or NFS. Multiple object stores support it, and as web applications become more popular than legacy OS native ones (à la Google Docs vs. Office), the amount of object data will only grow. Next generation file storage not only needs to support S3 as a first class protocol, but it also needs to support the custom metadata features it provides developers. With full S3 protocol support, cloud based file storage can serve all unstructured data workloads, regardless of whether or not they're file or object based. The result is more data living in the same storage for the data scientists, and better more accurate models.

  • Fully programmable

With infrastructure and workloads being lifted and shifted to the cloud, the amount of physical interaction with hardware is being minimized. Instead, management and configuration of infrastructure is being done via terminals and scripts. Next generation file storage needs to embrace the move from on-prem into the cloud by building API-first experiences and integrating with cloud management platforms.

File storage companies have an obligation to their customers to navigate hybrid complexity. Too many vendors don't have the business models or products to help reduce complexity for the overwhelmed storage buyer. It's time for the industry to embrace this once-in-a-generation change, to stop holding customers and shift to cheering them on in their journey to the cloud.

Grant Gumina is product development manager at Qumulo.

 

AIwire