Exploring the implications of OpenAI’s acquisition of Rockset for the realtime data analytics space.
OpenAI’s acquisition of Rockset is major news in the realtime data arena. But what does this move really signify? Let’s delve into what Rockset solved, why OpenAI bought them, and what this means for the broader market and their customers.
Rockset is known for powering “in-app search and analytics.” They compete with platforms like ClickHouse, Imply (Apache Druid), StarTree (Apache Pinot), and Elasticsearch. Interestingly, ClickHouse and Tinybird have launched aggressive campaigns encouraging users to migrate from Rockset.
This situates Rockset firmly in the realtime search and analysis space. Considering the vast amount of data recall large language models (LLMs) require, their role becomes clearer. My experiences with Rockset include impressive demos involving massive ingestion and querying of JSON data, reminiscent of what Solr and Elasticsearch once offered.
There are several theories, but it primarily boils down to cost efficiency. It would likely cost OpenAI over 20% of the purchase price annually to use Rockset’s product as-is. I estimate they spend well north of $100 million a year on data services alone. Sunsetting the cloud services likely reflects high operational costs outweighing the revenue generated. Most startups struggle with margins, and maintaining revenue in the tens of millions doesn’t make sense when the team could be repurposed to fuel OpenAI’s ambitious projects. Building robust multi-tenant data isolation tools is challenging, and Rockset had some of the best low-effort ingestion capabilities available.
This acquisition serves as a significant validation for investments in the realtime analytics and search space. Many large players are vying for dominance in the LLM sector, and with Rockset and Tabular (now part of Databricks) exiting, certain portfolios are likely to see a spike.
However, this development highlights why open-source tools might be a more strategic choice moving forward. Transitioning away from proprietary tools like Rockset can be a challenge. Rockset built an excellent product, and its disappearance will be felt.
Personally, I hope to see more of this work migrate to Apache Flink and Apache Paimon. For recall, enrichment, and training pipelines, having exactly-once stream and batch semantics on a realtime data lake can enhance user experience significantly.
For more information, you can read the original discussion here.