The key tech that enables Cloudera’s New Lakehouse

technology

[ad_1]

(Mattel-Photography_1/Shutterstock)

Cloudera today introduced CDP 1, its newest software-as-a-service (SaaS) lakehouse offering. For the first time, Cloudera has taken over managing the data platform on behalf of its customers. It’s Cloudera’s first official foray into the world of Data Lakes House, and it’s enabled support for a key technology component.

It’s been three years since Cloudera launched the Cloudera Data Platform (CDP), which marks the company’s transition from its past as a Hadoop distributor to its future as a cloud-based Platforms as a Service (PaaS) provider.

As an integration of Cloudera and Hortonworks Hadoop distributions, CDP was similar to previous Hadoop clusters. Data processing engines like Hive, Impala, Spark and MapReduce were still there. But CDP gives users the option to use interesting new components in the public cloud, such as Kubernetes instead of YARN for the programming environment, and S3 for the storage layer.

With CDP One, Cloudera is now taking the final step to offer the system as a managed service in the cloud, simplifying the day-to-day management of the platform, said Cloudera CTO Ram Venkatesh.

“We had a PaaS offering in place for over two years, not SaaS,” says Venkatesh. “Claudra ran the control plane, but the real workload was in the customer accounts. Now with SaaS, everything is in-house Cloudera and it’s zero ops for the customer, completely managed by Cloudera.

CDP One is now available on AWS, with beta on Microsoft Azure. Google Cloud support will follow, says Venkatesh.

As far as Lake House goes, it’s a branding effort on Clovera’s part. While Cloudera’s competitor Databricks popularized the term, it has since been adopted by many other cloud platform providers (including AWS, Google Cloud, and Snowflake) to refer to the integration of a data lake and data warehouse for analytics processing purposes.

“We’re an open source company, so we embrace innovation wherever we see it,” says Venkatesh. The name of the data Regarding the concept of the lake house. “It’s a great way to frame it in a way that our customers can understand.”

After Venkatesh launched Apache Hive in 2012, Claudera claims to have been the first vendor to have a lake house, according to Venkatesh. Exabytes of data are still stored in Hive-organized lakes, which are supported by all hyperscale, he said.

However, at this point, he says, the Hive Metastore is not the ideal logical support for modern lake house architecture. Other spreadsheet formats have appeared that overcome Hive’s technical limitations, including Databricks’ own Delta Lake and, more recently, Apache Iceberg.

“The problem is that this mapping between warehouse and lake has always been tightly coupled or biased towards one execution engine,” says Venkatesh. “So when Hive does it, it works great for Hive. And Spark, if you look hard enough, you can do it.

“Now with Spark and Delta Lake, it works perfectly if your whole world is monochromatic Spark,” he continues. “But if you really want to, we realized, there’s a piece in the middle, this glue between the warehouse and the lake; [which] What we call an open table format is an elementary self-contained concept.

Cloudera’s preferred open-table format is Apache Iceberg. In fact, Cloudera announced support for Iceberg back in June (at the Databricks annual conference, naturally). Iceberg support is now included in CDP 1, which allows customers to query their data through the query engine they want to use anywhere, without worrying about data loss, which was a common occurrence when he was in charge of the Hive metastore. The information.

“This is the first time in Apache Iceberg that this layer is not a slave to an engine,” says Venkatesh. “So at the top end, Iceberg works with Hive, it works with Spark, it works with Impala, it works with Presto. It does things that we don’t even support.

At the bottom end, Iceberg allows CDP customers to store their data in any disk format of their choice–CSV, Parquet, ORC, or Avro–on any file system, HDFS, S3, Azure Data Lake Storage (ADLS), or Google Cloud Storage (ADLS and GCS support is coming). .

Iceberg checks all the boxes that Cloudera wants in an open-source software product designed to enable enterprise-scale analytics, Venkatesh said. It’s open source, has an active community around it, and isn’t tied to a single vendor. “So how could we not be part of that innovation?” He says.

But Iceberg’s ability to support multiple use cases with a lake house design — and most importantly, its seamless support for multiple data engines — is really the deal that made Clara throw its weight behind it and add it as a feature in its shared data experience. (SDX) layer.

“We do really well when customers need to run more than one analysis on a data set,” says the CTO. “Typically, if they have a single use case, a single dataset, or only SQL, we might not be the best fit for them. But if they have a lot of data prep, real-time and batch data, they have SQL, they have some machine learning, they have some time series analytics, some If they have currency analytics—and this is what big enterprise data platforms look like—they’re integrating data in ways you never thought it would or when it came.

Hybrid cloud is a strength for Cloudera CDP, says CTO Ram Venkatesh (Nattapol_Sritongcom/Shutterstock)

“When customers do this multi-functional analysis, the connections between these engines become very clear,” he continues. “Hive, Impala and Spark didn’t work very well together in the way they expected. This was a real pain point for our customers. Now with Iceberg, you’ll see that we embrace this layer to be open.

Another benefit Cloudra hopes to exploit in the future is the ability to run on pre-order. The Santa Clara, Calif.-based vendor’s ability to run Lake House on an on-premises, public cloud or SaaS delivery model gives it an edge over its competitors in the cloud.

“It’s crucial,” says Venkatesh. “For our customers, one size doesn’t fit all. Even Amazon says in their own research that the cloud is really gaining adoption.” [and that] In the year By 2025, half of the world’s data will be in the public cloud. It’s a great story. I love that story. But what about the other half? “

According to Venkatesh, many customers do not run their lake house in the cloud. Whether it’s a matter of scale, geography or regulations, there are enterprise accounts that need to keep their data on-premise.

“We’re in a unique position with this dynamic, which we think is the superpower of the cloud,” he says. “When that’s what our customers want, we’re a hybrid.”

Related Items:

Cloudera Chooses Iceberg, Touts 10x Boost on Impala

Cloudera going private with a $5.3 billion purchase by Wall Street firms

Cloudera is ushering in a new cloud era with CDP

[ad_2]

Source link

Leave a Reply

Your email address will not be published. Required fields are marked *