Efficiently Load Millions of Rows Daily from S3 into AWS Redshift

11 min read

12/4/2025

Efficiently Load Millions of Rows Daily from S3 into AWS Redshift

Nasrul Hasan

Dec 2025

Cover Image for Efficiently Load Millions of Rows Daily from S3 into AWS Redshift

A few months ago I built a pipeline for a logistics analytics team that collects package events—delivery scans, route status updates, warehouse entries, etc. The events come from 11 distributed warehouses across India, aggregating to ~40M records/day.

This system is now running in production, costing under ₹20K/month (about $250–300) and scaling well.

This is how it works.

High Level requirement

Warehouse	Avg Records/Day
Chennai	4M
Pune	6M
Delhi	5M
Bangalore	9M
Misc others (7)	16M

Total ≈ 40–45 million/day

Query Use Cases

Number of packages delivered per state
Delivery lag between warehouse entry → successful delivery
Per-user delivery throughput
Route delays based on timestamp ranges

SLA

Data must be available in Redshift every morning before 7:30 AM.

Final Architecture

Warehouse Systems → Kafka → S3 Raw Zone (/dt=YYYY-MM-DD)

Glue Streaming Job  
        ↓
S3 Processed Zone (Parquet, partitioned)

Glue Batch ETL (Spark)  
        ↓
Redshift Staging  
        ↓
Redshift MERGE into Fact Tablespgsql

Why Glue + Spark?

Because:

✔ Data volume increases month-on-month ✔ Schema evolves weekly ✔ Redshift COPY is not cost-efficient or flexible for streaming ✔ We needed parallel transformations ✔ We needed schema enforcement, dedup, watermarking

Yes, COPY is good, but only if your files are clean, partitioned, and validated.

Our raw data was not.

Actual S3 Layout

s3://logistics/events/raw/dt=2025-12-01/hour=08/batch-000530.json
s3://logistics/events/raw/dt=2025-12-01/hour=08/batch-000531.json
...
s3://logistics/events/processed/dt=2025-12-01/hour=08/file-00212.snappy.parquetawk

Partitioning by date + hour was critical Queries run mostly daily/hourly aggregations.

Glue Streaming Job (this actually runs)

We trigger a Glue streaming job every 5 minutes, reading Kafka topic output dumped into S3.

Key Logic:

Convert JSON to structured DataFrame
Validate schema
Drop duplicates using event_id
Write Parquet by partition

Job snippet:

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import col, to_timestamp, year, month, dayofmonth, hour

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

df = spark.readStream.format("json") \
        .load("s3://logistics/events/raw/")

cleanDF = df.select(
    col("event_id"),
    col("package_id"),
    col("status"),
    to_timestamp(col("event_ts")).alias("event_ts"),
    col("warehouse"),
    col("payload")
).dropDuplicates(["event_id"])

finalDF = cleanDF \
    .withColumn("dt", col("event_ts").cast("date")) \
    .withColumn("hh", hour(col("event_ts")))

glueContext.write_stream(
    frame_or_dfc=finalDF,
    connection_type="s3",
    connection_options={
        "path": "s3://logistics/events/processed/",
        "partitionKeys": ["dt", "hh"]
    },
    format="parquet"
)stylus

This single decision (partitioning + parquet) reduced Redshift load time by 65%.

Batch Job for Redshift Load

Every day at 6:40 AM, Glue triggers a Spark job:

Tasks:

Read yesterday’s partition
Create surrogate keys
Compute delivery SLA metrics
Insert into Redshift staging
Run stored procedure for merge

Spark Code Excerpt:

processedDF = spark.read.parquet(
    "s3://logistics/events/processed/dt=2025-12-01/"
)

processedDF.write \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", redshift_jdbc) \
    .option("dbtable", "staging_delivery_events") \
    .option("aws_iam_role", iam_role) \
    .mode("overwrite") \
    .save()stylus

This loads ~45M rows into staging in ~6 minutes. COPY couldn’t achieve this due to dedupe & schema logic.

Redshift Merge Logic

We did NOT do naive deletes.

We used Redshift MERGE, which is highly efficient on RA3 clusters.

MERGE INTO fact_delivery_events AS target
USING staging_delivery_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE
SET status = source.status,
    event_ts = source.event_ts,
    warehouse = source.warehouse
WHEN NOT MATCHED THEN INSERT
(
   event_id, package_id, status, warehouse, event_ts
)
VALUES
(
   source.event_id, 
   source.package_id, 
   source.status, 
   source.warehouse, 
   source.event_ts
);n1ql

Runtime: ~8 minutes Previous naive batch INSERT runtime: ~25 minutes

Real Problems We Faced

Problem #1

Late-arriving events (sometimes 2 days old)

Fix: Glue triggers incremental backfill job → partition-based query → efficient rerun

Problem #2

Same event_id arriving twice due to upstream retries

Fix:

Spark .dropDuplicates(["event_id"])stylus

Problem #3

Redshift VACUUM used to run for 4+ hours

Fix:

ALTER TABLE fact_delivery_events
SET TABLE PROPERTIES (table_type = 'TABLE', autovacuume = true);pgsql

Final Performance

Stage	Time
Streaming ingestion + partitioning	Continuous
Glue Batch Transform (~45M rows)	~7–8 min
Redshift Write to Staging	~5–6 min
Merge to Fact Table	~8 min
Total SLA Achieved	~20–25 min

And completely scalable:

New warehouses? → Just add new partitions.
Higher daily volume? → Spark parallelizes automatically.

Monthly AWS Cost (real bill numbers - 2025)

Service	Cost
Glue Streaming	₹11,200 (~$140)
Glue Batch	₹3800 (~$45)
Redshift RA3 x2	₹4800 (~$60 — reserved instance)
S3 Storage	₹1200 (~$14)
TOTAL	~₹21,000 ($250–260)

That is all for today. Thanks for reading this. if you want to learn how did i created project structure and perform scheduling and deployment Please follow and check my next blog.

#AWS #Redshift #Glue #Pipeline #ETL