10 min read

Must-Know AWS Glue Interview Questions and Answers

Nasrul Hasan
Nasrul Hasan
Nasrul Hasan
Cover Image for Must-Know AWS Glue Interview Questions and Answers

What is AWS Glue Crawlers?

A Glue crawler is simply a service that scans your data source—mostly S3 in data lake setups—and automatically figures out the schema and creates tables inside the Glue Data Catalog. It can detect new partitions and even update the schema when new columns appear in files. Crawlers are most commonly used on raw S3 data where schema or file structure may change over time. Although crawlers can handle new columns well, they don’t handle type changes or removing old columns. For strict schema management you normally switch to formats like Iceberg or enforce schema in ETL instead of relying fully on crawlers.

Follow up questions:

· How does Glue handle schema drift? → Crawler adds new columns but does not remove or change types.

· Why not run crawler after every file? → Expensive, unnecessary; run daily/hourly unless schema changes very frequently.

· When to avoid crawlers? → When schema is static or when you maintain schema manually in ETL.

What is Glue data catalog?

The Glue Data Catalog acts as the metadata store for your entire data lake. Whenever you create tables, either manually or through crawlers, the catalog stores schema, partitions, table type, and file location. Tools like Athena, Redshift Spectrum, EMR, and Glue ETL all rely on this catalog to know how to read data. The catalog doesn’t store actual data, it only stores metadata. Best practice is to organize it by database names like raw, processed, and curated.

source_dyf = glueContext.create_dynamic_frame.from_catalog(
     database="raw_db",
     table_name="orders_raw"
)
routeros

Describe etl jobs in glue?

Glue ETL job is simply a Spark job managed by AWS. You write code in Python (PySpark + Glue libraries) and Glue handles cluster provisioning. Typically, the job reads data from S3 or a database, applies some transformations, and writes back to S3 in another layer like processed or curated. Glue uses DynamicFrames by default because they’re more flexible with schema evolution compared to DataFrames. Inside the script you always use a structured pattern: initialize job → read → transform → write → commit.

from awsglue.context import GlueContext
from pyspark.context import SparkContext
from awsglue.job import Job

sc = SparkContext()
glueContext = GlueContext(sc)
job = Job(glueContext)

job.init("etl-job")

source = glueContext.create_dynamic_frame.from_catalog(
    database="raw_db",
    table_name="customer_raw"
)

df = source.toDF()
df = df.withColumn("ingest_date", current_date())

glueContext.write_dynamic_frame.from_options(
    frame=DynamicFrame.fromDF(df, glueContext, "processed"),
    connection_type="s3",
    format="parquet",
    connection_options={"path": "s3://lake/processed/customers/"}
)

job.commit()
stylus

Convert from dyf to df and df to dyf

df= dyf.toDF() 
dyf = DynamicFrame.fromDF(df, GlueContext(sc), “My_dynamic_frame”)
reasonml

What is connections in aws glue?

Glue connections help Glue jobs communicate with external systems like Redshift, RDS, Snowflake, or any JDBC-compliant database. The connection defines network details like VPC, subnet, JDBC URL, and security groups so the Glue job can reach the database. Once the connection is created, you refer to it inside your ETL script to read or write tables.

redshift_data = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_name="redshift_conn",
    connection_options={
        "dbtable": "public.sales",
        "database": "analytics",
        "aws_iam_role": "arn:aws:iam::123:role/redshiftCopyRole"
    }
)
routeros

How do we implement incremental load in aws glue?

Glue bookmarks are a built-in feature that let a Glue ETL job remember what data it has already processed. This is extremely useful when doing incremental loads because Glue automatically keeps track of “last processed files” or “last processed rows” depending on the source. For example, when reading from S3, Glue records which files were read in the last job run and will ignore them on the next run. When reading from Redshift or any JDBC source, Glue can track a column (like updated_at) so that only rows newer than the previous run are processed. Bookmarks are stored internally inside Glue and don’t require you to build any custom watermark logic unless you want more control.

Follow up questions:

· What happens if I delete and recreate the job? → Bookmarks reset.

· How do I force reread of all data? → Set job bookmark option to “job-bookmark-reset”.

How to handle changing schemas in AWS Glue

Glue can handle mild schema drift using DynamicFrames; for example, it can detect new columns through crawlers and can convert conflicting types with resolveChoice(). However, Glue is not good at handling destructive or complex schema changes like removing columns or retyping from string to int. That’s why for modern pipelines, formats like Iceberg or Hudi are preferred because they support real schema evolution with ACID properties.

Using dynamic frames we can use resolve choice to handle schema changes.

resolved = source.resolveChoice(
    specs=[
        ("amount", "cast:double"),
        ("order_date", "cast:string")
    ]
)
gradle

Incremental load across database.

Incremental load is the method of processing only new or changed data in each ETL run instead of reprocessing the entire dataset. In Glue, you can achieve this in multiple ways. The simplest method is to rely on Glue bookmarks, which automatically track processed files or rows. Another approach is using a timestamp column (watermark) in databases like Redshift, where your query filters only rows with updated_at > last_processed_time. For S3, sometimes teams maintain their own watermark file in S3 to store last processed timestamp for full control. Incremental loading is important in production pipelines because it reduces compute cost and avoids duplication in S3.

query = "(SELECT * FROM orders WHERE updated_at > '{{ last_watermark }}') as t"
dyf = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_name="redshift_conn",
    connection_options={"query": query}
)
n1ql

How do we trigger glue job as soon as file arrives in s3?

This pattern is used when you want real-time or event-driven ETL. When a new file lands in S3, an event is automatically triggered, which invokes a Lambda function. The Lambda extracts the bucket and file key from the event and starts a Glue job. This approach is extremely common in ingestion pipelines where each incoming file needs to be processed individually. You can also pass file-specific arguments to Glue so the job knows what to process.

import boto3
glue = boto3.client("glue")


def handler(event, context):
    record = event["Records"][0]["s3"]
    bucket = record["bucket"]["name"]
    key = record["object"]["key"]

    glue.start_job_run(
        JobName="file-etl",
        Arguments={
            "--bucket": bucket,
            "--key": key
        }
    )
prolog

Follow up questions:

· Does Lambda check job success? → No, it only triggers. Monitoring is done via EventBridge.

· Why not let S3 trigger Glue directly? → Glue does not support direct S3 triggers.

· Can Lambda trigger Step Functions instead? → Yes, for complex workflows.

Once Job is triggered how to monitor glue job run?

When a Glue job finishes—whether it succeeds, fails, or times out—Glue automatically emits events to EventBridge. You can create rules in EventBridge to filter for specific job states. Most commonly, teams configure a rule that triggers an SNS notification when a job fails or succeeds. This is essential for production monitoring because it sends alerts without adding logic into Lambda or Glue itself.

{
  "source": ["aws.glue"],
  "detail-type": ["Glue Job State Change"],
  "detail": {
    "jobName": ["file-etl"],
    "state": ["SUCCEEDED"]
  }
}
prolog

SNS target setup: EventBridge → Add target → SNS topic → Subscribers (email, Slack webhook, etc.)

Follow up questions:

· Can EventBridge notify on job failure? → Yes, just change "state": ["FAILED"].

· How fast are notifications? → Usually 1–5 seconds.

· Can EventBridge trigger another Glue job? → Yes, you can chain dependent jobs.

That’s all for today

Hit follow if you're preparing for Data Engineer interviews.