Every company aims to be data driven, but bringing accurate data in front of the right stakeholders in a timely manner can be quite complex. The challenge arises even more when the source data resides in different technologies, with various access interfaces and data formats.
This is where the combination of Aiven for Apache Kafka® and Google BigQuery excels, by providing the ability to source the data from a wide ecosystem of tools and, in streaming mode, push it to BigQuery where datasets can be organized, manipulated and queried.
From data sources to Apache Kafka with Kafka Connect
Aiven, together with Apache Kafka, offers the ability to create a managed Kafka Connect cluster. The range of 30+ connectors available enables integrating Kafka with a wide set of various technologies as both source and sink using a JSON configuration file. Even more, if the connector for the particular technology needed isn’t in the list, an integration with a self-managed Kafka Connect cluster provides complete freedom on the connector selection, while keeping the benefit of the fully-managed Apache Kafka cluster.
If the datasource is a database, connectors like the Debezium source for PostgreSQL can enable a reliable and fast change data capture mechanism using the native database replication features, thereby adding minimal load on the source system.
Data in Apache Kafka
During the ingestion phase, to optimize throughput, connectors can use the Avro data format and store the data’s schema in Karapace, Aiven’s open source tool for schema registry and REST API endpoints.
Data in Apache Kafka is stored in topics which can have an associated retention period defining the amount of time or space for which the data will be kept. The topics can be read by one or more consumers independently or in competition as part of the same application (“consumer group” in Apache Kafka terms).
If some reshaping of the data is needed, before it lands on the target datastore, Aiven for Apache Flink allows, in streaming mode, to perform such transformations by using SQL statements. Cleansing or enrichment projects with data coming from different technologies are common examples.
Push data to Google BigQuery
Once the data is in the right shape to be analyzed, the Apache Kafka topic can be pushed to BigQuery in streaming mode using the dedicatedsink connector. The connector has a wide range of configuration options including the timestamp to be used for partitioning and the thread pool size defining the number of concurrent writing threads.
The data, coming in streaming mode via Apache Kafka, is now landed in one or more BigQuery tables, ready for further analysis and processing. BigQuery offers a rich set of SQL functions allowing to parse nested datasets, apply complex geographical transformations, and even train and use machine learning models amongst others. The depth of BigQuery SQL functions enable analysts, data engineers and scientists to perform their work in a unique platform using the common SQL language.
A streaming solution for fast analytics
With the wide set of source connectors available and its streaming capabilities, Aiven for Apache Kafka is the perfect fit to enable the data to flow from a huge variety of data sources to BigQuery for analytics.
One example of a customer using this pattern is the retail media platform Streaem, part of Wunderman Thompson. Streaem provides a self-service retail media platform for retailers and their brands to monetise areas of their site and in store digital assets by combining intelligence about their products and signals from their customers along with campaign information provided by advertisers. For example, a user might type “Coke” into a search box, and as well as seeing the regular results they will also see some sponsored listings. Then, as they browse around the site, there could be promoted products based on their previous interaction.
Streaem are fully committed to using Google Cloud as their platform of choice, but their technology is event-driven and based around Kafka as a message broker which is not natively available. Using Aiven’s Apache Kafka offering on top of Google Cloud lets Streaem get the best of both worlds; industry-standard event streaming on their preferred cloud, without the headache of managing Kafka themselves. With multiple microservices deployed, all of which need a consistent and up-to-date view of the world, Kafka is an obvious service to place at the center of their world to make sure everything has the latest information in a way which will scale effortlessly as Streaem itself reaches new heights.
“At Streaem we use Kafka as a core part of our platform where event-based data is a key enabler for the services we deliver to our clients” says Garry Turkington, CTO. “Using hosted data services on GCP allows us to focus on our core business logic while we rely on Aiven to deliver a high-performance and reliable platform we can trust.”
Analytics is still a challenge in a Kafka-only world, so Streaem uses a managed open-source Kafka Connector on the Aiven platform to stream the microservices data into Google BigQuery. This means that data about customer activity or keyword auctions or anything else in the live platform are available with low latency into BigQuery, powering Streaem’s reporting dashboards and providing up-to-date aggregations for live decisions to be made. By using Google Cloud Platform, Aiven for Apache Kafka, and BigQuery, Streaem can be confident that their production systems are running smoothly whilst they concentrate their efforts on growing their business.
Other use cases
Aiven for Apache Kafka along with Google Cloud BigQuery is driving crucial insights across a range of industry verticals and use cases. For example:
- Retail: Demand Planning with BQML, Recommendation Engines, Product Search
- Aiven is leveraged at a large European retail chain for open source database and event streaming infrastructure (Aiven for Apache Kafka, Aiven for OpenSearch, Aiven for Postgres, Aiven for Redis). The data is then fed to trained models in BigQuery ML to recommend products to purchase. These models can be exposed as APIs managed in Vertex AI for production applications.
- E-commerce: Real-Time Dynamic Pricing
- A global travel booking site uses Aiven for data streaming infrastructure (Aiven for Apache Kafka), handling global pricing and demand data in near real-time, and Aiven for OpenSearch for SIEM and application search use cases. Data then flows into BigQuery for analytics, giving the customer a best-in-class enterprise data warehouse.
- Gaming: Player Analytics
- Aiven powers data streaming (via Aiven for Apache Kafka) for a Fortune 500 gaming company, supporting multiple gaming titles and more than 100 million players globally. Analytics in BigQuery drives critical insights using player metadata.
Conclusion / Next Steps
The combination of Aiven for Apache Kafka and Google BigQuery drives analytics on the latest data in near real time, minimizing the time to insight and maximizing the impact. Customers of Aiven and Google are already taking advantage of this powerful combination, and seeing the benefits to their business. If you would like to experience this for yourself, sign up for Aiven and use the following links to learn more:
- Aiven for Apache Kafka to discover the features, plans and options available for a managed Apache Kafka service
- Apache Kafka BigQuery sink connector to review the settings and examples of pushing data from Apache Kafka to BigQuery
- To learn more about Google Cloud BigQuery, click here.
- Ready to give it a try? Click here to check out Aiven’s listing on Google Cloud Marketplace, and let us know what you think.
By: Kevin Bowman (Solution Architect, Aiven) and Ritika Suri (Technology Partnerships Director, Google Cloud)
Originally published at Google Cloud Blog
Source: Cyberpogo