Redshift via Firehose
What is Redshift?
Amazon Redshift is a fast, fully managed, petabyte-scale data warehouse solution provided by AWS. It allows businesses to analyze and query large datasets stored in scalable storage systems. With its high performance and integration capabilities, Redshift is ideal for data warehousing and big data analytics.
Why Use the Kinesis Firehouse-Redshift Path?
Amazon Kinesis Data Firehose provides a direct, reliable, and efficient method to load streaming data into Redshift. Here are the reasons why this integration is recommended:
- Direct Integration: Firehose can accept data directly from MetaRouter’s forwarder, bypassing the need for an intermediary service like Kinesis Data Streams, simplifying the pipeline.
- Intermediary S3 Bucket: Firehose temporarily stores data in an S3 bucket before using the Redshift COPY command to load the data in bulk, ensuring reliable delivery.
- Error Handling: Data that cannot be delivered to Redshift is stored in the S3 bucket for up to 7 days, allowing multiple retry attempts before being discarded.
- Scalability: Supports buffering (size or time-based) to optimize the transfer to Redshift, ensuring efficient data handling.
- Ease of Setup: AWS handles most complexities, leaving minimal configurations for the user.
Requirements
- AWS Services:
- Amazon Redshift (Serverless or Provisioned Cluster).
- Amazon Kinesis Data Firehose.
- Amazon S3 bucket for intermediary data storage.
- IAM Configuration:
- Roles with the following permissions:
AmazonKinesisFirehoseFullAccess
AmazonS3FullAccess
AmazonRedshiftAllCommandsFullAccess
AmazonRedshiftDataFullAccess
AmazonRedshiftFullAccess
CloudWatchLogsFullAccess
- Roles with the following permissions:
- Network Configuration:
- A properly configured Virtual Private Cloud (VPC) with appropriate security group rules to allow Firehose access to Redshift.
- Products and services used in this path:
- Kinesis Data Firehouse
- S3 Bucket
- Redshift Serverless or Provisioned Cluster
Considerations
- Redshift Provisioned Cluster: Requires upfront capacity planning and manual cluster management, with pricing based on reserved or on-demand instances.
- Redshift Serverless: Automatically scales based on workload demands, eliminating the need for manual cluster management, with pay-as-you-go pricing.
- Pricing Resources:
Limitations and Caveats
- Single Table Output: Firehose only supports writing data to a single table in Redshift. To use multiple tables, consider:
- Staging Table Approach: Use a staging table in Redshift to receive all incoming data, then distribute it to multiple tables via scheduled queries. This approach offers flexibility, lower cost, and reduced complexity.
- Buffering Delays: Firehose buffers data until it reaches 1 MiB or 60 seconds, leading to a slight delay in data delivery.
- Error Reporting: Firehose's error logs are inconsistent, making it difficult to diagnose transfer failures.
- Dependency on Scheduled Queries: Data in the staging table must be processed with periodic queries, which introduces additional latency.
- Security and Network Requirements: Proper configuration of IAM roles, VPCs, and security groups is essential to allow Firehose access to Redshift.
- Public Endpoint Security Risk: Using a publicly accessible Redshift endpoint can reduce costs but may increase security risks.
Getting Started
Updated 7 days ago
What’s Next