Covering Scientific & Technical AI | Monday, December 2, 2024

Data Wrangling ‘Decoder Ring’ Homogenizes Polyglot Data Lakes 

As Hadoop data lakes grow – fed by multiple data sources in multiple data formats – the task of pulling insight from the lake becomes increasingly challenging. In fact, the tedious and time consuming chore of cleansing, distilling, blending and standardizing disparate data formats – called data preparation or data wrangling – sometimes requires the expertise of a data scientist long before the data ever gets to the data analyst.

Solving this problem is the mission of Trifacta, a San Francisco-based company that announced $35 million in growth-stage financing, bringing the company’s total amount raised to more than $76 million. The company reports rapid market uptake during 2015, with users at more than 3,000 companies and a sales jump of more than 700 percent as it competes with other data wrangling vendors, such as Paxata and RapidMiner.

Adam Wilson of Trifacta

Adam Wilson of Trifacta

The company’s Wrangler product line (consisting of a free sampler product and an enterprise version), ranked first in a study of data preparation products last year by Dresner Advisory Services, was built for today’s polyglot world of big data analytics. “How do we take really big, messy, complicated data sets coming at us at scale in ever increasing variety of formats?” said Trifacta CEO Adam Wilson, who calls Wrangler a data “decoder ring,” told EnterpriseTech. Developed over the past three years, Wrangler is constructed of proprietary Java and C++ code and applied machine learning techniques for consistency, conformity and completeness of data, and then compiled down to MapReduce and Spark for execution across Hadoop cluster environments.

Wilson said the objective is take data wrangling out of the hands of data scientists “so that business analysts who understand the data best can do a lot of the work in structuring, shaping and cleansing the information.”

A typical use case, Wilson said, is Royal Bank of Scotland, which is building a data lake enabling RBS to look at data across different financial products and different customer touch points. Every RBS financial product has its own separate repository with its own data structures and data formats. RBS also captures customer interactions that take place over chat, voice, email and other social media and mechanisms. Using Wrangler, all this data is combined to provide RBS with a holistic view of individual customers.

“Their ability to understand and interact with their customers is a much more complex problem than it used to be when you went into the branch office of your bank and sat down with a bank officer,” Wilson said. “So they’re using data to provide that level of intimacy.”

RBS can view the entirety of a customer’s activities with the bank, from semi-structured voice-to-text transcripts to information about the customer’s mortgage, checking and savings accounts. The result: preparation time for conferences with customers has been cut by a factor of 15X.

“They were spending more time analyzing what was happening with their clients as opposed to trying to knit the data together and make sure it gave you something clean and consistent that you could actually operate on,” Wilson said.

Another customer: Pfizer. Wilson said the pharmaceutical company outsources much of its clinical trial work to third parties, and high volumes of data come back in a variety of formats. Pfizer uses Wrangler to rationalize the data in order to do their internal analysis on the efficacy of the trials as well as comply with FDA regulatory reporting requirements.

Zurich Insurance Group uses Wrangler to form pools of data to construct optimized risk data models incorporating multiple data feeds, using their algorithms to tune risk profiles and better understand how to price their insurance products. “We help with the onboarding of new data sets and the wrangling of existing data sets that are there,” Wilson said.

Finally, there is GoPro, the maker of action cameras whose device sensors generate high volumes telemetry data, such as velocity, temperature, barometric pressure and location. Wilson said that as GoPro evolves from a camera company into a media company as well, they are using Wrangler to combine telemetry and CRM data with social feed commentary generated by viewers of GoPro videos. “If you’re a snowboarder, GoPro will make sure you’re offered the latest and greatest snowboarding videos created by the GoPro user base.”

“In each case, volumes tend to be huge, the data tends to be a combination of structured and unstructured, and in all examples they are using Hadoop under the covers to do the storage and the processing,” Wilson said. “Trifacta provides the data wrangling that helps them take very raw data and turn it into something that is highly refined and ready to feed either a downstream algorithm or feed a statistical package, standard BI and reporting solutions or other types of analytics.”

Wilson said Wrangler is composed of three core components:

Interactive Profiling: allows analysts to understand the shape and size of the data in a data lake, the inherent structure of the information and the level of data quality. “Data analysts tell us that until they get their eyes on data and see what they’re working with across multiple datasets, they’re not sure what questions they have or how best to wrangle the information. But once they use profiling to understand the outliers in the data sets, the holes, understand the groupings, that allows them in an agile way to understand what’s there.”

Predictive Transformation: based on the user’s interaction with the data, the system makes recommendations to fix problems of consistency and empower the user to change the shape of the data and to blend it with other data sources.

Intelligent Execution; enables all the rules and steps the user has defined to clean up the data to be compiled down to MapReduce or Spark and executed across the entire cluster, leveraging the power of the computing environment to do the work of transforming, cleansing and standardizing the data.

Trifacta originated eight years ago when the company’s co-founders, Joe Hellerstein, Sean Kandell and Jeff Heer, computer science professors at UC/Berkeley, Stanford and the University of Washington, respectively, worked together on data wrangling after realizing that data preparation consumed 80 percent of analytics projects. They developed a prototype, called Stanford Data Wrangler, that within six months had tens of thousands of users.

“They realized they had struck a nerve,” Wilson said.

AIwire