DAX Data Wrangling: Deutsche Börse Lifts Data Scientists’ ‘80% Problem’
At the German Stock Exchange – aka Deutsche Börse, home of the DAX – data scientists are getting unburdened from “janitorial” tasks that can consume up to 80 percent of their time and keep them from focusing on work worthy of their hard-to-hire and expensive skills. We’re talking about data prep: the niggling mindlessness, the funereal procession of yawn enforcing* chores required to get data ready for analytics and machine learning workloads.
“We faced the same issue as a lot of organizations, we wanted to exploit big data but doing it is difficult,” said Konrad Sippel, head of the Deutsche Börse’s Content Lab, an R&D center for big data and analytics, AI and machine learning. The team of about 20 data scientists and data engineers develops prototypes and proof of concepts designed to help the organization wring more value from its data.
“The big issue we’ve had is getting hold of all those great data sets and getting them into a place where data scientists can actually work with them in any form or shape,” Sippel told EnterpriseTech. “Even though we’re a very digital firm in the way we run digital trading places, all the data from the different areas is in very different places, it’s used in different formats, it’s sometimes dirty, some of it isn’t in the greatest shape.”
The Content Lab decided to look at data wrangling software, which holds the promise of data prep automation. To test the software, they began by hiring a consultant for must have been a data prep project from hell: cleaning up a large, complex, multi-formatted, multi-sourced data set. It took the poor data scientist nine months to complete the job, and the final result was used to measure the efficiency and thoroughness of Trifacta data wrangling software to structure, rationalize, combine and enrich raw data.
According to Sippel, Trifacta got it done in three weeks.
The results convinced Deutsche Börse to not only purchase the software but also, later, to invest in the company. “Prior to using Trifacta the data scientists spent a large part of their time not doing data science, doing nothing exciting at all,” Sippel said. “They were reformatting Excel tables, uploading and gathering data.”
A stock exchange, or “market infrastructure” as the Deutsche Börse refers to itself, generates enormous amounts of data. Sippel said a single day for a single trading instrument can generate 500 million lines of code, “and we look at this over many hundreds of instruments and many thousands of days, so it gets very big.” He said his group works with roughly 1-1.5 PB of data at a given time.
“We get contacted by people in the firm that need to solve data science problems, a business owner will say they’ve got some training data and they’re trying to gain some insight into how a customer is behaving, or the behavior of customers group-wide,” Sippel said.
Naturally, his group is under pressure to turn around results ASAP – he said the Content Lab tries to complete projects within two to three months. They also want to reassure their internal customers, called “case owners,” that their projects and their data are well handled. One advantage of Trifacta, Sippel said, is its ability to quickly ingest and conduct an initial data analysis.
Using Cloudera Data Science Workbench on top of their Hadoop cluster, Trifacta uploads the data and generates a statistical evaluation of the data set. “It shows how the data is distributed,” Sippel said, “it shows some of the peaks and some of the oddities, and we use that to get back to the use case owner right away, it tells them something about their data they didn’t know within the first 24 hours of having had access to their data. This typically impresses people quite a lot, it gives them the feeling their data is in good hands and that we're doing something useful with it.”
From there, projects can become highly complex, particularly those involving different data sources from different areas of the company in different formats.
“In those cases we use Trifacta a bit more also in the research process to prepare the data set to later work on it,” he said. To accelerate analysis, the Content Lab uses Impala for bigger database queries, which requires “putting all the data into one big table, and construction of those big tables is something we do in Trifacta.”
Then Trifacta moves onto more heavy-duty wrangling, making “transformative changes” to the data.
“Depending on the nature of the data set and what needs to be done, we’ll use Trifacta to bring together data from two different data sources,” he said. “It can be simple things like different date formats being used when you’re doing the join over the date. When you do it in Trifacta, those issues get harmonized out without the scientists having to do any transformation. It seems to happen almost automatically.”
He said data preparation at Deutsche Börse is notably complex due to the coded nature of the organization’s data.
“You have data about shares traded on the exchange, and quotes that member firms have submitted to buy and sell those shares,” said Sippel. “It’s a very structured data set, typically with a large amount of data and…it’s heavily normalized in order to be saved in our data warehouses.” This means “pretty much everything is a code, including the time of the trade, what type of trade it was, where it was transacted. So the information to interpret the tables lies in hundreds of other tables elsewhere, master data tables, that we need to bring in to make the data set legible again."
This complexity is heightened when the Content Lab combines internal data with data from external sources, such as Thompson Reuters or Bloomberg, which use entirely different codes and data formats.
“When we try to learn some things about the analysis of our own data combined with external data, that’s when we need to bring them together, and that’s a pretty good use case for Trifacta,” he said.
*See Lucky Jim, by Kingsley Amis.