Unify streaming and analytical data with Amazon Data Firehose and Amazon Sagemaker Lake Fire

Organizations are increasingly required to derive knowledge of their data in real time and at the same time maintain the ability to perform analytics. This dual requirement is a significant challenge: how to effectively bridge the gap between streaming and analytical workloads without creating complex, hard-to-comply with data pipes. In this paper, we demonstrate how to simplify this process using Amazon Data Firehose (Firehose) to supply streaming data directly to the Apache Iceberg tables in Amazon Sagemaker Lakehouse and create a current pipe that reduces the complexity and maintenance of overhead costs.

The streaming data is seized by AI and machine learning (ML), which learn and adapt in real time, which is essential to what requires immediate knowledge or dynamic reactions to changing conditions. This creates new opportunities for business agility and innovation. The key use of boxes is to predict devices based on data sensor data, monitor supplier chain processes in real time, and allow AI applications dynamically to indicate to changing conditions. Real -time streaming data helps customers to make quick decisions and fundamentally change Howisses Compette in real -time markets.

Amazon Data Firehose is smoothly acquired, transformed and supplied by data flows to the lake, data lakes, data warehouses and analytical services with automatic scaling and delivery in seconds. For analytical workload, the Lakehouse architecture appeared as an effective solution, which combined the best elements of data lakes and data warehouses. Apache Iceberg, an open table format, enables this transformation to provide transaction guarantees, scheme development and effective manipulation of metadata that were previously only available in traditional data warehouses. Sagemaker Lake unifies your data across Amazon Simple Storage Service (Amazon S3) data, Amazon RedShift data warehouses and other sources, and provide you with flexibility to access your data on the tool and engines with glacier. By using Sagemaker Lakehouse, organizations can take advantage of the glacier power and at the same time benefit from scalabibility and flexibility of the cloud solution. This integration removes traditional barriers between data storage and ML processes, so data workers can work directly with glaciers tables in their preferred tools and laptops.

In this post, we will show you how to create glacier tables in Amazon Sagemaker Unified Studio and stream data into these tables using Firehose. With this integration, data engineers, analysts and data scientists can work smoothly and build end-to-end analytics, and ML working procedures using Sagemaker Unified Studio, eliminate traditional silos and speed up the day to use data on production ml models.

Table of Contents

Solutions

The following diagram illustrates the architecture of how Firehose can supply data in real time to Sagemaker Lake.

This post contains the Cloudformation AWS template to set up support sources, so Firehose can supply streaming data to Iceberg tables. You can check and customize it to follow your needs. The template performs the following operations:

Prerequisites

For this passage you should have the following assumptions:

After creating the assumptions, check that you can sign up for the SageMaker Unified Studio and the project is successfully created. Each project created in the Sagemaker Unified Studio acquires the project and the role of the IAM project, as emphasized in the following screen.

Create a glacier table

For this solution we use Amazon Athena as an engine for our queries editor. Create the following steps to create a glacier table:

In Sagemaker Unified Studio on Create Menu, select Question Editor.

Select Athen as a engine editor motor and select AWS GRUE for the project.

Use the following SQL Stemment to create a glacier. Be sure to provide your AWS Glue Database project and Project Amazon S3 Rent (you can find on the project overview page):

CREATE TABLE firehose_events (
type struct,
customer_id string,
event_timestamp timestamp,
region string)
LOCATION '/iceberg/events'
TBLPROPERTIES (
'table_type'='iceberg',
'write_compression'='zstd'
);

Deploy support sources

The next step is to deploy the desired renewal into your AWS environment using a cloud template. Complete the following steps:

Choose Trigger.
Choose Other.
Leave the name of the magazine as firehose-lakehouse.
Provide the username and password you want to access the Amazon Kinesis data generator.
For DatabasenameEnter the name of the AWS adhesive database.
For ProjectbucketnameEnter the name of the project bucket (located on the SageMaker Unified Studio Project Details).
For TabenameEnter the table name created in Sagemaker Unified Studio.
Choose Other.

Choose I acknowledge that cloudformation AWS could create iam resources and choose Other.

Complete the tray.

Create the flow of fire

Complete the following steps to create Firehose current to deliver Amazon S3 data:

The Firehose console is selected Create Stream Firehose.

For Sourcechoose Direct put.
For Targetchoose Apache Iceberg tables.

This example will select Direct put As a source, but you can use the same steps for other Firhose sources, such as Amazon Kinesis and Amazon -managed streaming data for Apache Kafka (Amazon MSK).

For Firehose current nameenter firehose-iceberg-events.

Collect the name of the database and the name of the table from the Sagemaker Unified Studio project that you want to use in the next step.

IN Settings Section, permission Inline Analysis to Rut Information and provides a database name and a table name from the previous step.

If you want to deliver data to one database and table, make sure to save data in one database and table names. Amazon Data Firehose can also direct records to different tables based on the content of the record. For more information, return to the direction of incoming records to different glacial tables.

Under Cache hintReduce the buffer size to 1 MIB and an interval to 60 seconds. You can fine -tune these settings based on your latency needs.

IN Backup settings Section, enter the S3 bucket created by the cloudformation template (s3://firehose-demo-iceberg--) and an error output prefix (error/events-1/).

IN Advanced settings Section, enable Amazon Cloudwatch error logging for troubleshooting Existing role of iamChoose a role that begins Firehose-Iceberg-Stack-FirehoseIamRole-*Created by cloudformation template.
Choose Create Stream Firehose.

Generate data streaming data

Use the Kinesis Amazon Kinesis data generator to publish data records to your Firehose stream:

On cloud -formation of the AWS console, select Stack In the navigation pane and open your magazine.
Select a nested tank for a generator and go to Outputs Tab.
Select the Amazon Kinesis URL URL generator.

Enter the CreditDentials you have defined when you deploy the Cloudformation magazine.

Select the AWS area where you have deployed cloudformation and select your Firehose Stream.
For the template, replace the default values with the following code:

{
"type": {
"device": "{{random.arrayElement(("mobile", "desktop", "tablet"))}}",
"event": "{{random.arrayElement(("firehose_events_1", "firehose_events_2"))}}",
"action": "update"
},
"customer_id": "{{random.number({ "min": 1, "max": 1500})}}",
"event_timestamp": "{{date.now("YYYY-MM-DDTHH:mm:ss.SSS")}}",
"region": "{{random.arrayElement(("pdx", "nyc"))}}"
}

Choose before sending data Template test To see an example of a useful load.
Choose Send data.

You can monitor the waist current.

Question to the table in Sagemaker Unified Studio

Now that Firehose delivers data to Sagemaker Lake, you can perform analysts in the Sagemaker Unified Studio using various AWS analytical services on this data.

Clean up

In general, it is a good practice to clean the resources created as part of this post to get additional costs. Complete the following steps:

On cloud -formation of the AWS console, select Stack In the navigation pane.
Select stack firehose-lakehouse* and on Action Menu, select Delete.
In Sagemaker Unified Studio, delete a domain created for this post.

Conclusion

Streaming data allows models to make predictions or decisions based on the latest information, which is essential for the timeline applications. By incorporating real -time data, he can make more accurate predictions and decisions. Data streaming can help organizations avoid costs associated with storing and processing large data sets, as it focuses on the most reports. Amazon Data Firehose causes it easy to bring real -time streaming data to Iceberg data lakes and unify it with other data assets in SageMaker Lakehouse, making streaming data by various analytical and AI services in Sagemaker Unified Studio for real -time deformation. Try a solution for your own use and share feedback and questions in the comments.

About the authors

Kalyan Janaki Is a senior specialist Big Data & Analytics with Amazon Web Services. The architect helps customers and build a highly scalable, efficient and secure cloud solutions on AWS.

Phaneendra vuliyaragoli is the product manager for Amazon Data Firehose in AWS. The Phaneendra Product and GO-to-Market Strategy for Amazon Data Firehose is held in this role.

Maria Ho It is a product marketing manager for streaming and sending messages in AWS. Wins Works, including Amazon managed for Apache Kafka (Amazon MSK), Amazon Managed Services for Apache Flink, Amazon Data Firehose, Amazon Kinesis Data Streams, Amazon MQ, Amazon Simple Tail Service (Amazon SQS) and Amazon Simple Notifications).