Last updated July 15, 2024
In Developers Corner

Databricks is Taking the Ultimate Risk of Building ‘USB for AI’

Databricks envisions bringing both Delta Lake and Iceberg formats closer in the future to a point where their differences won’t matter.

Share

Published on June 15, 2024

by Sukriti Gupta

Databricks acquiring Tabular was the talk of the Bay Area at the recent Data + AI Summit.

Whether by coincidence or not, the announcement was made during Snowflake’s Data Cloud Summit, held last week.

Databricks chief Ali Ghodsi definitely has some answers, or maybe not?

“Now, at Databricks, we have employees from both of these projects, Delta and Iceberg. We really want to double down on ensuring that Delta Lake UniForm has full 100% compatibility and interoperability for both of those,” Ghodsi said, admitting they don’t understand all the intricacies of the Iceberg format, but the original creators of Apache Iceberg do.

Talking about Databricks’ mission to democratise data and AI, Ghodsi started his keynote by saying that every company today wants GenAI but at the same time everybody is worried about the security and privacy of their data estate which is highly fragmented.

He pointed out that the data estate of every company is placed into several data warehouses and the data is siloed everywhere. This ends up bringing a lot of complexity and huge costs to the companies and ultimately gets them locked into these proprietary system silos.

Databricks’ Delta Lake Project (+ Apache Iceberg) to the Rescue!

With a focus on addressing these issues, Databricks announced the open-source Delta Lake Project a few years back.

Ghodsi explained that the idea was to let users own their data and store it in data lakes where any vendor can then plug their data platforms into that data, allowing users to decide which platform suits them best. This removes lock-in, reduces the cost, and also lets users get many more use cases by giving them the choice to use different engines for different purposes if they want.

“This was our vision and we almost succeeded but unfortunately there are now two camps. At Databricks we have Delta Lake, but a lot of other vendors are using this other format called Apache Iceberg,” said Ghodsi.

Delta Lake and Apache Iceberg emerged as the two leading open-source standards for data lakehouse formats. Despite sharing similar goals and designs, they became incompatible due to their independent development.

Over time, various open-source and proprietary engines adopted these formats. However, they typically adopted only one of the standards, and frequently, only aspects of it. This selective adoption effectively fragmented and siloed enterprise data, undermining the value of the lakehouse architecture.

Now, with the Tabular acquisition, Databricks intends to work closely with the Iceberg and Delta Lake communities to bring interoperability to the formats themselves as highlighted by Ghodsi.

Tabular, Inc, a data management company was founded by Ryan Blue, Daniel Weeks, and Jason Reid. Blue and Weeks had developed the Iceberg project at Netflix and donated it to the Apache Software Foundation.

As the largest contributor to Iceberg, Tabular is seen as the company driving Iceberg, playing a key role in advancing Iceberg within data management frameworks.

“I’ve known Ryan for a long time. We worked closely with him when he was back at Netflix, and some of the team members were working with him even before that when he was at Cloudera. So it’s been a very close collaboration,” Ghodsi said.

Databricks’ UniForm, now generally available, offers interoperability among Delta Lake, Iceberg, and Hudi. It supports the Iceberg restful catalogue interface so that companies can use their existing analytics engines and tools across all their data.

Furthermore, with the inclusion of the original Iceberg team, the company plans to expand the scope and ambitions of Delta Lake UniForm. It envisions bringing both the Delta Lake and Iceberg formats closer in the future to a point where their differences won’t matter, according to Ghodsi.

So basically, with the Tabular acquisition, Databricks seems to be trying to build this pendrive or USB port of sorts that can be plugged into AI systems in the future — achieving 100% interoperability.

“It will simplify the developer experience and allow them to move up the stack in the value chain. As, instead of worrying about which version of Iceberg or Delta they are using, developers can be rest assured that all of that is solved through the UniForm format,” said Databricks’ vice president of field engineering APJ Nick Eayrs in an exclusive interview with AIM.

Eayrs explained that with this, developers will now be able to spend more time on analysis, enrichment, and transformation of the data rather than worrying about version conflicts.

“Our commitment is reinforced to open source. We have open-sourced our Unity Catalog, and we continue to build the default open standard when it comes to the data format,” he added.

The Other Side of the Acquisition

The tabular acquisition development came just after Databricks’ competitor Snowflake announced it had adopted Apache Iceberg tables as a native format and also introduced Polaris, a catalogue for Iceberg tables accessible by any data processing engine that could read the format, such as Spark, Dremio, and even Snowflake itself.

In addition, Microsoft announced an expanded partnership with Snowflake. As part of this, Microsoft Fabric’s OneLake will now support Snowflake’s Iceberg tables and facilitate bi-directional data access between Snowflake and Fabric.

Databricks’ decision to acquire Tabular was spurred by customer demand for better interoperability among data lake formats. People have also weighed in on the significance of Databricks’ Tabular purchase in light of Snowflake’s activity.

While the acquisition of Tabular indicates that both Databricks and Snowflake are positioning for the influence of AI on data infrastructure, the purchase has clearly put new pressure on Databricks’ competitors including Snowflake.

When asked if Databricks is planning to work closely with Snowflake to bring Iceberg and Delta Lake together, Ghodsi added, “Governance in open-source projects involves working closely with the community and those that have committers in the project. And if some of these committers happen to be employed by Snowflake, we’ll work with them as well.”

📣 Want to advertise in AIM? Book here

Sukriti Gupta

Having done her undergrad in engineering and masters in journalism, Sukriti likes combining her technical know-how and storytelling to simplify seemingly complicated tech topics in a way everyone can understand