Overview

AWS Glue is a fully managed, serverless ETL (extract, transform, and load) service that assists in preparing and processing data for analytics and machine learning workloads. Launched in 2017, it is part of the broader Amazon Web Services ecosystem, which began in 2006. Glue is designed to simplify the complexities of data integration by providing a centralized metadata repository known as the AWS Glue Data Catalog, automated ETL job generation, and flexible execution environments.

The service is well-suited for organizations building data lakes on Amazon S3, as it can crawl various data stores, discover schemas, and populate the Data Catalog with metadata. This catalog then serves as a unified view across disparate data sources, enabling other AWS services like Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR to query the data efficiently. Glue supports a range of data sources, including Amazon S3, Amazon RDS, Amazon Redshift, and various JDBC-compliant data stores, allowing for comprehensive data integration scenarios.

Developers and data engineers utilize AWS Glue to create and manage ETL pipelines programmatically or through visual tools like AWS Glue Studio. It supports custom Python and Scala scripts, leveraging Apache Spark and Apache Ray engines for distributed processing. This flexibility enables users to perform complex data transformations, enrich datasets, and move data between different storage systems. For instance, a common use case involves transforming raw log files stored in S3 into a columnar format like Parquet for optimized analytical queries.

AWS Glue also includes specialized components like AWS Glue DataBrew for visual data preparation by data analysts and business users, and AWS Glue Data Quality for monitoring and improving data reliability. These features extend Glue's utility beyond traditional ETL, addressing data governance and accessibility. The serverless nature of Glue means that users do not need to provision or manage underlying infrastructure; AWS automatically scales resources based on job requirements, charging primarily based on the duration of compute resources consumed, measured in Data Processing Units (DPU-hours), as detailed on the AWS Glue pricing page.

Key features

  • AWS Glue Data Catalog: A centralized metadata repository that stores schema, table definitions, and location of data from various sources, making it discoverable and queryable by other AWS services.
  • Serverless ETL Engine: Executes Spark and Ray jobs without requiring users to provision or manage servers, automatically scaling resources based on workload demands.
  • AWS Glue Studio: A visual interface for creating, running, and monitoring ETL jobs, allowing users to design data flows with drag-and-drop functionality.
  • AWS Glue DataBrew: A visual data preparation tool that enables data analysts and data scientists to clean, normalize, and transform data without writing code.
  • AWS Glue Data Quality: Provides tools to define, monitor, and manage data quality rules, automatically detecting and alerting on data quality issues within ETL pipelines.
  • Crawler: Automatically discovers schema and partitions from data sources and populates the AWS Glue Data Catalog, supporting a wide range of data formats and databases.
  • Job Bookmarks: Helps process new data by tracking previously processed data, allowing jobs to resume from the last known good state and avoid reprocessing.
  • Developer Endpoints: Provide a development environment for testing and debugging Glue scripts interactively using notebooks or preferred IDEs.
  • Integration with AWS Services: Seamlessly integrates with Amazon S3, Amazon Redshift, Amazon Athena, Amazon EMR, AWS Lambda, and other services for end-to-end data pipelines.

Pricing

AWS Glue pricing is primarily based on a pay-as-you-go model, with costs varying depending on the specific Glue component utilized. The core ETL jobs are billed per Data Processing Unit (DPU-hour), with different DPU configurations available for various job types (e.g., standard, G.1X, G.2X). Interactive sessions and development endpoints are also billed per DPU-hour. The AWS Glue Data Catalog charges for the number of objects stored and the number of requests made. AWS Glue DataBrew and Glue Data Quality have separate pricing structures based on recipe execution time and data scanned, respectively.

AWS Glue Pricing Overview (As of 2026-05-07) Source: AWS Glue Pricing

Component Pricing Metric Details
ETL Jobs (Spark/Ray) DPU-hour Billed per second, with a 1-minute minimum. Varies by DPU type (e.g., Standard, G.1X, G.2X).
Interactive Sessions DPU-hour Billed per second, with a 1-minute minimum.
Development Endpoints DPU-hour Billed per second, with a 1-minute minimum.
Data Catalog (Storage) ObjectsStored First 1 million objects are free. Charges per 100,000 objects over 1 million per month.
Data Catalog (Requests) Requests First 1 million requests are free. Charges per 1 million requests over 1 million per month.
AWS Glue DataBrew Recipe Run Time Billed per hour of recipe execution.
AWS Glue Data Quality Data Scanned Billed per GB of data scanned.

Common integrations

Alternatives

  • Databricks: Offers a unified data and AI platform built on Apache Spark, providing advanced analytics, machine learning, and data engineering capabilities.
  • Google Cloud Dataflow: A fully managed service for executing Apache Beam pipelines, designed for both batch and stream data processing on Google Cloud. Google Cloud Dataflow documentation
  • Azure Data Factory: A cloud-based ETL and data integration service that allows users to create data-driven workflows for orchestrating data movement and transformation at scale. Azure Data Factory documentation
  • Apache NiFi: An open-source system for automating the flow of data between software systems, offering a web-based UI for designing data flows.

Getting started

To begin using AWS Glue, you typically define a crawler to discover your data's schema and then create an ETL job to transform and load the data. Here's a basic Python script for an AWS Glue ETL job that reads data from an S3 bucket, transforms it (e.g., adding a timestamp), and writes it back to S3. This script assumes you have an S3 bucket named your-source-bucket with data in a folder named raw/ and an output bucket named your-target-bucket.

Before running this, ensure your Glue job's IAM role has permissions for S3 read/write operations and Glue service access. This example is a simplified representation; in practice, Glue jobs are often configured via the AWS Glue console or AWS Glue API.


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from datetime import datetime

## @params --JOB_NAME
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Define your S3 source path
source_s3_path = "s3://your-source-bucket/raw/"

# Read data from S3 using Glue's dynamic frame
datasource0 = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": [source_s3_path],
        "recurse": True
    },
    format="json", # or "csv", "parquet", etc.
    transformation_ctx="datasource0"
)

# Convert to a Spark DataFrame for transformations
df = datasource0.toDF()

# Example transformation: Add a processing timestamp
df_transformed = df.withColumn("processed_timestamp", current_timestamp())

# Convert back to a DynamicFrame for writing with Glue
datasink0 = DynamicFrame.fromDF(df_transformed, glueContext, "datasink0")

# Define your S3 target path
target_s3_path = "s3://your-target-bucket/processed/"

# Write transformed data to S3
glueContext.write_dynamic_frame.from_options(
    frame=datasink0,
    connection_type="s3",
    connection_options={
        "path": target_s3_path,
        "partitionKeys": ["processed_timestamp_date"] # Example partition key
    },
    format="parquet", # or "json", "csv", etc.
    transformation_ctx="datasink0"
)

job.commit()