While AI has hogged much of the limelight in the past few years, Databricks’ products SVP Adam Conway rightly points out that big data has finally made a comeback.
The influx of AI systems has highlighted the importance of data within businesses, with many pivoting towards actually utilising the data they generate on a day-to-day basis. For this, Conway has highlighted just how “big” big data has become, when compared to what someone would have meant when they used the term in the 2010s.
“I talk to enterprises regularly that have data lakes in the 10s to 100s of petabytes (PB), even a few enterprises in the over one exabyte (EB) range. Organisations often have single tables in the multi-PB range – in the heyday of ‘big data’ in 2010, 1 PB was huge, now, I would consider it on the smaller side,” he said.
Every Bit Counts
While data was and continues to be a big deal, enterprises seem to be taking active notice of how to actually deal with the data they have, thanks to the rise of AI.
This is further cemented by MotherDuck co-founder and CEO Jordan Tigani, who headed Google’s BigQuery team until 2020. In a post titled ‘Big Data is Dead’, Tigani stated that the often repeated adage of big data being “too big” has all but disappeared thanks to better tools used to query massive amounts of data.
Essentially, data is no longer considered “big” anymore.
“Data sizes may have gotten marginally larger, but hardware has gotten bigger at an even faster rate. It (big data) had a good run, but now we can stop worrying about data size and focus on how we’re going to use it to make better decisions,” he said.
With many businesses pivoting to using AI when it comes to how they work, data is a big factor in how they assess the effectiveness of their products, as well as just gaining basic insights on their inner workings.
Conway also points out that the actual term ‘big data’ is specific, because with AI, aggregates no longer fulfil the needs of businesses. As we said, every little bit of data counts. “Aggregates are great for BI and reporting, like if you want to look at revenue by customer or by product, but terrible if you want to do AI.
“For example, if you want to predict that a customer will buy an iced drink on a hot day you need all the individual transactions to train that model, you need to join that with weather data and train a model,” he said.
So while AI is the reason why data has made such a big comeback, at least in how businesses use their own, it also needs to feed on raw data to effectively work.
Does This Mean Data Wasn’t Relevant Before?
Not really. Data obviously formed a lot of decisions that businesses made and continue to make today. But with AI, data is now a massive resource, and everyone knows that.
This is further cemented by the amount of data acquisition that has occurred over the past couple of years. In the past year alone, OpenAI has managed to sign deals with several media organisations to make use of their data.
However, as one user points out, “Data needs to be curated!”
AI companies have put a lot of focus on properly annotating and structuring data for training purposes. With this tech more widely available, companies have also been restructuring their data so that it can easily train AI that they intend to use internally.
“Many of the most important revenue generating or cost saving AI workloads depend on massive data sets. In many cases, there is no AI without big data,” Conway said. And, of course, there is no resurgence of big data without AI.
With Big Data Comes Data Engineering
Thanks to this, there’s also a massive change in data and analyst roles, with data engineering becoming a much sought-after career for many.
Databricks knows this, which is why it’s currently betting big on data engineering rather than focusing on AI itself. It has also launched tools like LakeFlow to enhance the workflow of data engineers.
This doesn’t come as a surprise. As AIM reported previously, Databricks’ CEO Ali Ghodsi admitted that customers had asked for a focus on data over anything else. “Two years ago, at the CIO Forum, we asked our customers what they wanted most from Databricks, and the majority expressed a need for easier data integration,” he said.
Additionally, as Databricks vice president of field engineering APJ, Nick Eayrs, told AIM, the focus on data engineering itself means that building AI and implementing it has a solid foundation, thanks to AI’s heavy reliance on data.
This is why Conway advises, “So next time you see the data engineer in your company that works in Spark or Hadoop, ask them what they do, ask what kind of data your company has, and ask what is done with it. You will probably be pleasantly surprised. Big Data is probably quietly transforming your company.”