AWS Glue is a serverless ETL (extract, transform, and load) service that helps users prepare and process data for analytics and machine learning. It provides a data catalog, automated job generation, and a flexible execution environment.

How does AWS Glue handle data cataloging?

AWS Glue uses crawlers to automatically discover schemas and partitions from various data sources. This metadata is then stored in the AWS Glue Data Catalog, which acts as a central repository for data definitions across your AWS environment.

What programming languages does AWS Glue support for ETL jobs?

AWS Glue supports Python and Scala for writing ETL scripts. These scripts leverage Apache Spark and Apache Ray for distributed data processing.

Is AWS Glue serverless?

Yes, AWS Glue is a serverless service. Users do not need to provision, manage, or scale any infrastructure. AWS automatically handles resource allocation and scaling based on job requirements.

What is a DPU in AWS Glue pricing?

DPU stands for Data Processing Unit. It is a relative measure of processing capacity that combines compute and memory. AWS Glue charges for ETL jobs based on the number of DPU-hours consumed.

Can AWS Glue process streaming data?

Yes, AWS Glue supports streaming ETL jobs that can consume data from sources like Amazon Kinesis and Apache Kafka, allowing for real-time data processing and analysis.

What is the difference between AWS Glue and AWS Glue DataBrew?

AWS Glue is a comprehensive ETL service for data engineers. AWS Glue DataBrew is a visual data preparation tool within the Glue family, designed for data analysts and business users to clean and transform data without writing code.

AWS Glue – Serverless ETL and Data Integration Service

Overview

AWS Glue is a fully managed, serverless ETL (extract, transform, and load) service that assists in preparing and processing data for analytics and machine learning workloads. Launched in 2017, it is part of the broader Amazon Web Services ecosystem, which began in 2006. Glue is designed to simplify the complexities of data integration by providing a centralized metadata repository known as the AWS Glue Data Catalog, automated ETL job generation, and flexible execution environments.

The service is well-suited for organizations building data lakes on Amazon S3, as it can crawl various data stores, discover schemas, and populate the Data Catalog with metadata. This catalog then serves as a unified view across disparate data sources, enabling other AWS services like Amazon Athena, Amazon Redshift Spectrum, and Amazon EMR to query the data efficiently. Glue supports a range of data sources, including Amazon S3, Amazon RDS, Amazon Redshift, and various JDBC-compliant data stores, allowing for comprehensive data integration scenarios.

Developers and data engineers utilize AWS Glue to create and manage ETL pipelines programmatically or through visual tools like AWS Glue Studio. It supports custom Python and Scala scripts, leveraging Apache Spark and Apache Ray engines for distributed processing. This flexibility enables users to perform complex data transformations, enrich datasets, and move data between different storage systems. For instance, a common use case involves transforming raw log files stored in S3 into a columnar format like Parquet for optimized analytical queries.

AWS Glue also includes specialized components like AWS Glue DataBrew for visual data preparation by data analysts and business users, and AWS Glue Data Quality for monitoring and improving data reliability. These features extend Glue's utility beyond traditional ETL, addressing data governance and accessibility. The serverless nature of Glue means that users do not need to provision or manage underlying infrastructure; AWS automatically scales resources based on job requirements, charging primarily based on the duration of compute resources consumed, measured in Data Processing Units (DPU-hours), as detailed on the AWS Glue pricing page.

Key features

AWS Glue Data Catalog: A centralized metadata repository that stores schema, table definitions, and location of data from various sources, making it discoverable and queryable by other AWS services.
Serverless ETL Engine: Executes Spark and Ray jobs without requiring users to provision or manage servers, automatically scaling resources based on workload demands.
AWS Glue Studio: A visual interface for creating, running, and monitoring ETL jobs, allowing users to design data flows with drag-and-drop functionality.
AWS Glue DataBrew: A visual data preparation tool that enables data analysts and data scientists to clean, normalize, and transform data without writing code.
AWS Glue Data Quality: Provides tools to define, monitor, and manage data quality rules, automatically detecting and alerting on data quality issues within ETL pipelines.
Crawler: Automatically discovers schema and partitions from data sources and populates the AWS Glue Data Catalog, supporting a wide range of data formats and databases.
Job Bookmarks: Helps process new data by tracking previously processed data, allowing jobs to resume from the last known good state and avoid reprocessing.
Developer Endpoints: Provide a development environment for testing and debugging Glue scripts interactively using notebooks or preferred IDEs.
Integration with AWS Services: Seamlessly integrates with Amazon S3, Amazon Redshift, Amazon Athena, Amazon EMR, AWS Lambda, and other services for end-to-end data pipelines.

Pricing

AWS Glue pricing is primarily based on a pay-as-you-go model, with costs varying depending on the specific Glue component utilized. The core ETL jobs are billed per Data Processing Unit (DPU-hour), with different DPU configurations available for various job types (e.g., standard, G.1X, G.2X). Interactive sessions and development endpoints are also billed per DPU-hour. The AWS Glue Data Catalog charges for the number of objects stored and the number of requests made. AWS Glue DataBrew and Glue Data Quality have separate pricing structures based on recipe execution time and data scanned, respectively.

AWS Glue Pricing Overview (As of 2026-05-07) Source: AWS Glue Pricing

Component	Pricing Metric	Details
ETL Jobs (Spark/Ray)	DPU-hour	Billed per second, with a 1-minute minimum. Varies by DPU type (e.g., Standard, G.1X, G.2X).
Interactive Sessions	DPU-hour	Billed per second, with a 1-minute minimum.
Development Endpoints	DPU-hour	Billed per second, with a 1-minute minimum.
Data Catalog (Storage)	ObjectsStored	First 1 million objects are free. Charges per 100,000 objects over 1 million per month.
Data Catalog (Requests)	Requests	First 1 million requests are free. Charges per 1 million requests over 1 million per month.
AWS Glue DataBrew	Recipe Run Time	Billed per hour of recipe execution.
AWS Glue Data Quality	Data Scanned	Billed per GB of data scanned.

Common integrations

Amazon S3: Serves as a primary data lake storage for Glue, with crawlers discovering schema and ETL jobs reading/writing data. Integrating Glue with Amazon S3
Amazon Redshift: Glue can extract data from Redshift, transform it, and load it back, or be used for initial data loading into Redshift. Connecting Glue to Amazon Redshift
Amazon Athena: Queries data directly from S3 using schemas defined in the AWS Glue Data Catalog, enabling interactive analytics. Athena and Glue Data Catalog
Amazon EMR: Glue Data Catalog can be used as a metastore for Apache Spark and Hive on EMR clusters. Using Glue Data Catalog with Amazon EMR
AWS Lambda: Can trigger Glue jobs in response to events, such as new files arriving in an S3 bucket, forming event-driven ETL pipelines. Triggering Glue jobs with AWS Lambda
Amazon Kinesis: Glue streaming ETL jobs can consume data directly from Kinesis data streams for real-time processing. Processing streaming data with Glue
Amazon RDS: Glue can connect to various relational databases hosted on Amazon RDS (e.g., PostgreSQL, MySQL, SQL Server) to extract and process data. Connecting Glue to Amazon RDS

Alternatives

Databricks: Offers a unified data and AI platform built on Apache Spark, providing advanced analytics, machine learning, and data engineering capabilities.
Google Cloud Dataflow: A fully managed service for executing Apache Beam pipelines, designed for both batch and stream data processing on Google Cloud. Google Cloud Dataflow documentation
Azure Data Factory: A cloud-based ETL and data integration service that allows users to create data-driven workflows for orchestrating data movement and transformation at scale. Azure Data Factory documentation
Apache NiFi: An open-source system for automating the flow of data between software systems, offering a web-based UI for designing data flows.

Getting started

To begin using AWS Glue, you typically define a crawler to discover your data's schema and then create an ETL job to transform and load the data. Here's a basic Python script for an AWS Glue ETL job that reads data from an S3 bucket, transforms it (e.g., adding a timestamp), and writes it back to S3. This script assumes you have an S3 bucket named your-source-bucket with data in a folder named raw/ and an output bucket named your-target-bucket.

Before running this, ensure your Glue job's IAM role has permissions for S3 read/write operations and Glue service access. This example is a simplified representation; in practice, Glue jobs are often configured via the AWS Glue console or AWS Glue API.


import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from datetime import datetime

## @params --JOB_NAME
args = getResolvedOptions(sys.argv, ['JOB_NAME'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Define your S3 source path
source_s3_path = "s3://your-source-bucket/raw/"

# Read data from S3 using Glue's dynamic frame
datasource0 = glueContext.create_dynamic_frame.from_options(
    connection_type="s3",
    connection_options={
        "paths": [source_s3_path],
        "recurse": True
    },
    format="json", # or "csv", "parquet", etc.
    transformation_ctx="datasource0"
)

# Convert to a Spark DataFrame for transformations
df = datasource0.toDF()

# Example transformation: Add a processing timestamp
df_transformed = df.withColumn("processed_timestamp", current_timestamp())

# Convert back to a DynamicFrame for writing with Glue
datasink0 = DynamicFrame.fromDF(df_transformed, glueContext, "datasink0")

# Define your S3 target path
target_s3_path = "s3://your-target-bucket/processed/"

# Write transformed data to S3
glueContext.write_dynamic_frame.from_options(
    frame=datasink0,
    connection_type="s3",
    connection_options={
        "path": target_s3_path,
        "partitionKeys": ["processed_timestamp_date"] # Example partition key
    },
    format="parquet", # or "json", "csv", etc.
    transformation_ctx="datasink0"
)

job.commit()

AWS Glue – Serverless ETL and Data Integration Service

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

# frequently asked questions

## reviews

## comments

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

# frequently asked questions

# see also

## reviews

## comments