AWS Athena is a serverless interactive query service that allows you to analyze data directly in Amazon S3 and other data sources using standard SQL or Apache Spark, without managing any infrastructure.

How does Athena integrate with Amazon S3?

Athena directly queries data stored in Amazon S3. You define tables and schemas in the AWS Glue Data Catalog that point to your S3 data, and Athena uses these definitions to execute SQL queries on the files.

What are the core differences between Athena SQL and Athena for Apache Spark?

Athena SQL is ideal for ad-hoc, interactive SQL queries and reporting on structured and semi-structured data. Athena for Apache Spark provides a fully managed Spark runtime, enabling more complex data transformations, machine learning, and streaming analytics using PySpark, Scala, or Spark SQL.

How is AWS Athena priced?

For SQL queries, Athena is priced per terabyte (TB) of data scanned. For Athena for Apache Spark, pricing is based on Data Processing Unit (DPU) hours consumed, with a minimum billing duration.

Can Athena query data outside of Amazon S3?

Yes, through Athena Federated Query, you can use built-in or custom connectors to query data from a variety of sources beyond S3, including relational databases, NoSQL databases, and other cloud services.

What data formats does Athena support?

Athena supports various data formats, including CSV, JSON, ORC, Parquet, Avro, and more. Using columnar formats like ORC and Parquet can significantly reduce query costs and improve performance by reducing the amount of data scanned.

Is there a free tier for AWS Athena?

Yes, new AWS customers receive a free tier for Athena, covering the first 10 TB of data scanned per month for 12 months. Standard S3 storage costs still apply.

AWS Athena — Serverless SQL and Spark Query Service

Overview

AWS Athena is a serverless interactive query service that enables analysts and developers to analyze data directly in Amazon S3 using standard SQL. It eliminates the need to provision, manage, or scale servers, allowing users to focus on data analysis rather than infrastructure management. Athena supports a wide range of data formats, including CSV, JSON, ORC, Parquet, and Avro, and integrates with the AWS Glue Data Catalog for metadata management and schema discovery. This integration allows users to define schemas and table structures over their raw data stored in S3, making it queryable via SQL.

Beyond traditional SQL querying, Athena also offers Amazon Athena for Apache Spark, which provides a fully managed, serverless Spark environment. This expands Athena's capabilities to include more complex data transformations, machine learning workloads, and real-time analytics for those who prefer the Spark ecosystem. Users can run Spark applications without managing Spark clusters, benefiting from the same serverless operational model as the SQL engine.

Athena is frequently used for ad-hoc querying, exploratory data analysis, generating reports, and integrating with other AWS analytics and business intelligence services like Amazon QuickSight and AWS Glue. Its pay-per-query pricing model, based on the amount of data scanned, makes it a cost-effective solution for analyzing large datasets where queries are not continuous, such as infrequently accessed logs or historical archives. For new AWS customers, a free tier is available, covering the first 10 TB scanned per month for 12 months.

The service is designed for both technical users familiar with SQL and data analysts. Its integration with the AWS ecosystem provides a streamlined experience for those already operating within AWS, reducing the overhead of managing separate data storage and query systems. According to a report on cloud costs by a16z, optimizing data analytics workloads is a key area for cost efficiency, where services like Athena's pay-per-query model can provide advantages over always-on data warehouses for certain use cases.

Key features

Serverless Architecture: Users do not need to provision, manage, or scale any servers or data warehousing infrastructure, as AWS handles all operational aspects.
Standard SQL Support: Athena supports standard ANSI SQL, allowing developers and analysts to use familiar query language for data analysis.
Direct S3 Integration: Queries run directly against data stored in Amazon S3, eliminating the need to move data into a separate data warehouse.
Apache Spark Support: Provides a serverless runtime for Apache Spark, supporting Python (PySpark), Scala, and SQL for advanced analytics and machine learning.
AWS Glue Data Catalog Integration: Natively integrates with AWS Glue Data Catalog for schema storage and retrieval, enabling table definitions over S3 data.
Support for Multiple Data Formats: Compatible with various data formats including CSV, JSON, ORC, Parquet, and Avro, enhancing flexibility in data ingestion.
Cost-Effective Pricing: Pay-per-query model for SQL based on data scanned, and per Data Processing Unit (DPU) hour for Spark, optimizing costs for intermittent workloads.
Security and Compliance: Supports encryption at rest and in transit, VPC integration, and is compliant with standards like HIPAA BAA, PCI DSS, and ISO 27001, as detailed in AWS compliance documentation.
Integration with AWS Services: Seamlessly integrates with other AWS services like Amazon QuickSight for visualization, AWS Glue for ETL, and AWS Lake Formation for data lake governance.
Query Federation: Enables querying data from sources beyond S3, such as relational databases, NoSQL databases, and other data stores, through Athena Federated Query.

Pricing

AWS Athena offers two primary pricing models, depending on whether you are using the SQL query engine or Apache Spark. Both models are consumption-based.

Service Component	Pricing Metric	Cost (as of 2026-05-07)
SQL Query Engine	Data Scanned per Query	$5.00 per TB scanned
SQL Query Engine	Failed Queries	No charge
Amazon Athena for Apache Spark	DPU-hours	Starts at $0.30 per DPU-hour (varies by region)
Amazon Athena for Apache Spark	Minimum Charge	1 minute (initial charge), then billed in 1-second increments

For detailed and region-specific pricing, refer to the official AWS Athena pricing page.

Common integrations

Amazon S3: Primary data lake storage for Athena, where all queried data resides (Amazon S3 integration guide).
AWS Glue Data Catalog: Used to store metadata and table definitions for data in S3 that Athena queries (AWS Glue Data Catalog with Athena).
Amazon QuickSight: Business intelligence service for creating dashboards and visualizations from Athena query results (Connecting QuickSight to Athena).
AWS Lake Formation: Centralized security, governance, and auditing for data lakes built on S3, integrated with Athena for fine-grained access control (Lake Formation and Athena).
AWS Lambda: Used for custom logic, data processing, or triggering Athena queries based on events (Lambda connectors for federated queries).
AWS IAM: Manages access and permissions for users and services interacting with Athena (IAM for Athena access control).
Apache Spark: Athena provides a fully managed Spark runtime for advanced data processing tasks (Athena for Apache Spark documentation).

Alternatives

Google BigQuery: A serverless, highly scalable, and cost-effective cloud data warehouse designed for machine learning, business intelligence, and geospatial analysis.
Databricks SQL: A serverless data warehousing solution built on the Databricks Lakehouse Platform, optimized for data scientists and analysts with strong SQL capabilities.
Snowflake: A cloud data platform that offers a data warehouse as a service, providing separate storage and compute for scalable analytics.
DigitalOcean Managed Databases for PostgreSQL: For smaller-scale data analysis where data is already in a relational database, offering a managed PostgreSQL service.

Getting started

This Python example demonstrates how to use the Boto3 AWS SDK for Python to execute a simple SQL query on AWS Athena. Before running, ensure you have AWS credentials configured and an S3 bucket with some data and a corresponding table defined in the AWS Glue Data Catalog. Replace 'your-s3-output-bucket', 'your-database-name', and 'your-table-name' with your actual values.


import boto3
import time

# Configure AWS region
AWS_REGION = 'us-east-1'

# Initialize Athena client
athena_client = boto3.client('athena', region_name=AWS_REGION)

# S3 bucket for query results
S3_OUTPUT_LOCATION = 's3://your-s3-output-bucket/athena-query-results/'

# Athena database and table
DATABASE_NAME = 'your-database-name'
TABLE_NAME = 'your-table-name'

# SQL query to execute
QUERY = f"SELECT * FROM {TABLE_NAME} LIMIT 10;"

def execute_athena_query(query_string, database, output_location):
    try:
        response = athena_client.start_query_execution(
            QueryString=query_string,
            QueryExecutionContext={
                'Database': database
            },
            ResultConfiguration={
                'OutputLocation': output_location
            }
        )
        query_execution_id = response['QueryExecutionId']
        print(f"Started query with ID: {query_execution_id}")
        return query_execution_id
    except Exception as e:
        print(f"Error starting query: {e}")
        return None

def get_query_results(query_execution_id):
    while True:
        query_status = athena_client.get_query_execution(
            QueryExecutionId=query_execution_id
        )
        status = query_status['QueryExecution']['Status']['State']

        if status == 'SUCCEEDED':
            print("Query SUCCEEDED.")
            results = athena_client.get_query_results(
                QueryExecutionId=query_execution_id
            )
            # Process results (example: print column names and first few rows)
            column_info = results['ResultSet']['ResultSetMetadata']['ColumnInfo']
            print("Column Names:", [col['Name'] for col in column_info])
            
            rows = results['ResultSet']['Rows']
            # Skip header row if present
            if len(rows) > 0 and 'Data' in rows[0]: # Check if the first row is actually data
                for i, row in enumerate(rows):
                    if i == 0 and [col['VarCharValue'] for col in row['Data']] == [col['Name'] for col in column_info]:
                        continue # Skip header row
                    print([col.get('VarCharValue', 'NULL') for col in row['Data']])

            return results
        elif status == 'FAILED' or status == 'CANCELLED':
            reason = query_status['QueryExecution']['Status'].get('StateChangeReason', 'N/A')
            print(f"Query {status}. Reason: {reason}")
            return None
        else:
            print(f"Query is still {status}. Waiting...")
            time.sleep(5)

if __name__ == "__main__":
    query_id = execute_athena_query(QUERY, DATABASE_NAME, S3_OUTPUT_LOCATION)
    if query_id:
        get_query_results(query_id)

AWS Athena

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

# frequently asked questions

## reviews

## comments

Overview

Key features

Pricing

Common integrations

Alternatives

Getting started

# frequently asked questions

# see also

## reviews

## comments