Covering Scientific & Technical AI | Friday, December 27, 2024

Dremio’s Dart Initiative Accelerates the Obsolescence of Cloud Data Warehouses 

June 3, 2021 -- Dremio, a leader in data lake transformation, today took a major step forward in obsoleting the cloud data warehouse. Today's release marks the first delivery in the company's Dart Initiative, which enables customers to run all mission-critical SQL workloads directly on the data lake.

Dremio is a service that sits between data lake storage and end users who want to directly query that data for high-performing dashboards and interactive analytics, without the need for copying data into data warehouses and the need for creating aggregation tables, extracts, cubes and other derivatives. Dremio drastically simplifies the data architecture, accelerates query performance, and enables data democratization without the vendor lock-in of cloud data warehouses.

While many of the world’s largest companies already use Dremio to power their mission-critical SQL workloads, Dremio embarked on the Dart Initiative to help companies run an even greater range of SQL workloads while improving performance over previous versions by over 2x and drastically reducing resource consumption.

"Enabling truly interactive query performance on cloud data lakes has been our mission from day one, but we’re always looking to push the boundaries and help our customers move faster. We launched the Dart Initiative to deliver just that," said Tomer Shiran, founder and chief product officer at Dremio. "Not only are we dramatically increasing speed and creating efficiencies, we’re also reducing costs for companies by eliminating the data warehouse tax without trade-offs between cost and performance."

Dremio’s Dart Initiative is a continuous improvement, high customer impact approach that leapfrogs the cloud data warehouse. "Delivering warehouse-grade performance directly on data lakes is a key element in enabling companies to adopt an open data architecture," Shiran continues. "We’re also very excited about the innovations within the Data Tier itself. We have table formats like Apache Iceberg, which enable multiple engines to work together on the same data in a transactionally consistent manner within the data lake. We also have projects like Project Nessie, which brings Git-like semantics to the data lake, dramatically accelerating the agility of data engineering, data science and analytics."

The following sections detail some of the innovations of this initial Dart Initiative release:

Fast and Optimal Query Planning

Database engines can choose from a wide range of strategies to plan queries, and the ability to generate an optimal query plan in any given situation can make a significant impact on performance. In this first release of the Dart Initiative, Dremio now gathers deep statistics about the underlying data, which helps Dremio’s query optimizer choose the optimal execution path for any given query.

The Dart Initiative also introduces query plan caching, which eliminates planning overhead and latency for repeated queries. This is particularly impactful for BI dashboarding use cases, where many users are simultaneously firing similar queries against the SQL engine as they navigate through dashboards. In these scenarios, the planning phase of queries often consumes a large proportion of the total query runtime, so eliminating this repeated planning workload yields a significant improvement in application response time.

Further, the Dart Initiative includes a high-performance compiler that enables much larger and more complex SQL statements with reduced resource requirements.

Comprehensive and ANSI-Standard SQL Coverage

The Dart Initiative empowers companies to run an even broader set of enterprise SQL workloads on Dremio by broadening SQL coverage to include additional functions, operators, and SQL grammar constructs, including additional window and aggregate functions, grouping sets, intersect, except/minor and more.

Faster Query Execution

Dremio is an in-memory engine powered by Apache Arrow, an open source columnar standard for in-memory computing that was co-created by Dremio. Gandiva, a component of Arrow, is an LLVM-based toolkit that enables vectorized execution directly on in-memory Arrow buffers, by generating code to evaluate SQL expressions that fully leverages the pipelining and SIMD capabilities of modern CPUs. The Dart Initiative provides a significant boost in performance of end-user queries with complex expressions by greatly extending Gandiva coverage to nearly all SQL functions, operators, and casts.

The Dart Initiative also reduces the I/O required to run a query. Services like Amazon S3 and Azure Data Lake Storage (ADLS) make it extremely simple and cheap for companies to store their corporate data, but companies are charged every time they read data from these services. Dremio estimates cloud storage read operations (e.g., via the S3 API) constitute up to 30% of query execution costs in some workloads; other sources estimate over 60%. With the Dart Initiative, Dremio reduces the amount of data read from cloud object storage through extensive enhancements in scan filter pushdown (now supporting multi-column pushdown into source reads, the ability to push filters across joins, and more).

Distributed and Real-Time Metadata Management

Through the Dart Initiative, Dremio now supports unlimited table sizes with an unlimited number of partitions and files, as well as near-instantaneous availability of new data and datasets as they are persisted on the lake. This is now possible with the introduction of manifest-based metadata and version management, supporting the largest datasets in enterprises with the most demanding data freshness SLAs.

Enhanced Acceleration Management

A key feature of the Dremio engine, which helps companies run mission-critical BI workloads directly on their cloud data lakes, is automated management of transparent query acceleration data structures (Data Reflections). With the Dart Initiative, Dremio greatly enhances the ability to support the orchestrated refresh of hundreds of these reflections within multi-tenant environments. Future Dart Initiative phases will continue to push acceleration management through improved refresh granularity, consistency across related reflections, and improved refresh monitoring and restartability.

The Dart Initiative aims to help companies run mission-critical SQL workloads faster and more efficiently than ever by optimizing every dimension of query execution in the Dremio Engine. These enhancements complement Dremio's existing innovations, including Apache Arrow, a columnar memory format and kernel with over 20M monthly downloads, C3, a columnar cloud cache, and Data Reflections, data structures that transparently accelerate common query patterns.

Today's release, which encompasses multiple facets of Dremio’s service and provides dramatic performance improvements of 2x over previous versions, is just the first phase in the Dart Initiative. Dremio will be delivering many more innovations this year both within the Dart Initiative and beyond as it moves closer to obsoleting the data warehouse.

*Originally Dremio’s internal memory format, Apache Arrow is now one of the most popular open-source projects with over 20M monthly downloads.

About Dremio

Dremio reimagines the data lake service to deliver faster time to analytics, eliminating the need for expensive proprietary systems. Dremio eliminates the need to copy and move data to proprietary data warehouses or create cubes, aggregation tables and BI extracts, providing flexibility and control for Data Architects, and self-service for data consumers. Founded in 2015, Dremio is headquartered in Santa Clara, CA. Investors include Cisco Investments, Insight Partners, Lightspeed Venture Partners, Norwest Venture Partners, Redpoint Ventures, and Sapphire Ventures. For more information, visit www.dremio.com.


Source: Dremio

AIwire