11 min read

Efficiently Load Millions of Rows Daily from S3 into AWS Redshift

Nasrul Hasan
Nasrul Hasan
Nasrul Hasan
Cover Image for Efficiently Load Millions of Rows Daily from S3 into AWS Redshift

A few months ago I built a pipeline for a logistics analytics team that collects package events—delivery scans, route status updates, warehouse entries, etc. The events come from 11 distributed warehouses across India, aggregating to ~40M records/day.

This system is now running in production, costing under ₹20K/month (about $250–300) and scaling well.

This is how it works.

High Level requirement

Warehouse

Avg Records/Day

Chennai

4M

Pune

6M

Delhi

5M

Bangalore

9M

Misc others (7)

16M

Total ≈ 40–45 million/day

Query Use Cases

  • Number of packages delivered per state

  • Delivery lag between warehouse entry → successful delivery

  • Per-user delivery throughput

  • Route delays based on timestamp ranges

SLA

Data must be available in Redshift every morning before 7:30 AM.

Final Architecture

Warehouse Systems → Kafka → S3 Raw Zone (/dt=YYYY-MM-DD)

Glue Streaming Job  
        ↓
S3 Processed Zone (Parquet, partitioned)

Glue Batch ETL (Spark)  
        ↓
Redshift Staging  
        ↓
Redshift MERGE into Fact Tables
pgsql

Why Glue + Spark?

Because:

✔ Data volume increases month-on-month ✔ Schema evolves weekly ✔ Redshift COPY is not cost-efficient or flexible for streaming ✔ We needed parallel transformations ✔ We needed schema enforcement, dedup, watermarking

Yes, COPY is good, but only if your files are clean, partitioned, and validated.

Our raw data was not.

Actual S3 Layout

s3://logistics/events/raw/dt=2025-12-01/hour=08/batch-000530.json
s3://logistics/events/raw/dt=2025-12-01/hour=08/batch-000531.json
...
s3://logistics/events/processed/dt=2025-12-01/hour=08/file-00212.snappy.parquet
awk

Partitioning by date + hour was critical Queries run mostly daily/hourly aggregations.

Glue Streaming Job (this actually runs)

We trigger a Glue streaming job every 5 minutes, reading Kafka topic output dumped into S3.

Key Logic:

  • Convert JSON to structured DataFrame

  • Validate schema

  • Drop duplicates using event_id

  • Write Parquet by partition

Job snippet:

from pyspark.context import SparkContext
from awsglue.context import GlueContext
from pyspark.sql.functions import col, to_timestamp, year, month, dayofmonth, hour

glueContext = GlueContext(SparkContext.getOrCreate())
spark = glueContext.spark_session

df = spark.readStream.format("json") \
        .load("s3://logistics/events/raw/")

cleanDF = df.select(
    col("event_id"),
    col("package_id"),
    col("status"),
    to_timestamp(col("event_ts")).alias("event_ts"),
    col("warehouse"),
    col("payload")
).dropDuplicates(["event_id"])

finalDF = cleanDF \
    .withColumn("dt", col("event_ts").cast("date")) \
    .withColumn("hh", hour(col("event_ts")))

glueContext.write_stream(
    frame_or_dfc=finalDF,
    connection_type="s3",
    connection_options={
        "path": "s3://logistics/events/processed/",
        "partitionKeys": ["dt", "hh"]
    },
    format="parquet"
)
stylus

This single decision (partitioning + parquet) reduced Redshift load time by 65%.

Batch Job for Redshift Load

Every day at 6:40 AM, Glue triggers a Spark job:

Tasks:

  1. Read yesterday’s partition

  2. Create surrogate keys

  3. Compute delivery SLA metrics

  4. Insert into Redshift staging

  5. Run stored procedure for merge

Spark Code Excerpt:

processedDF = spark.read.parquet(
    "s3://logistics/events/processed/dt=2025-12-01/"
)

processedDF.write \
    .format("io.github.spark_redshift_community.spark.redshift") \
    .option("url", redshift_jdbc) \
    .option("dbtable", "staging_delivery_events") \
    .option("aws_iam_role", iam_role) \
    .mode("overwrite") \
    .save()
stylus

This loads ~45M rows into staging in ~6 minutes. COPY couldn’t achieve this due to dedupe & schema logic.

Redshift Merge Logic

We did NOT do naive deletes.

We used Redshift MERGE, which is highly efficient on RA3 clusters.

MERGE INTO fact_delivery_events AS target
USING staging_delivery_events AS source
ON target.event_id = source.event_id
WHEN MATCHED THEN UPDATE
SET status = source.status,
    event_ts = source.event_ts,
    warehouse = source.warehouse
WHEN NOT MATCHED THEN INSERT
(
   event_id, package_id, status, warehouse, event_ts
)
VALUES
(
   source.event_id, 
   source.package_id, 
   source.status, 
   source.warehouse, 
   source.event_ts
);
n1ql

Runtime: ~8 minutes Previous naive batch INSERT runtime: ~25 minutes

Real Problems We Faced

Problem #1

Late-arriving events (sometimes 2 days old)

Fix: Glue triggers incremental backfill job → partition-based query → efficient rerun

Problem #2

Same event_id arriving twice due to upstream retries

Fix:
Spark .dropDuplicates(["event_id"])
stylus

Problem #3

Redshift VACUUM used to run for 4+ hours

Fix:

ALTER TABLE fact_delivery_events
SET TABLE PROPERTIES (table_type = 'TABLE', autovacuume = true);
sql

Final Performance

Stage

Time

Streaming ingestion + partitioning

Continuous

Glue Batch Transform (~45M rows)

~7–8 min

Redshift Write to Staging

~5–6 min

Merge to Fact Table

~8 min

Total SLA Achieved

~20–25 min

And completely scalable:

  • New warehouses? → Just add new partitions.

  • Higher daily volume? → Spark parallelizes automatically.

Monthly AWS Cost (real bill numbers - 2025)

Service

Cost

Glue Streaming

₹11,200 (~$140)

Glue Batch

₹3800 (~$45)

Redshift RA3 x2

₹4800 (~$60 — reserved instance)

S3 Storage

₹1200 (~$14)

TOTAL

~₹21,000 ($250–260)

That is all for today. Thanks for reading this. if you want to learn how did i created project structure and perform scheduling and deployment Please follow and check my next blog.