Docs / Athena

Athena

LocalEmu Athena implements all 70 operations. The control plane (workgroups, named queries, prepared statements, data catalogs, capacity reservations, sessions, calculations) is served by the moto-ext metadata backend. The five custom operations (StartQueryExecution, GetQueryExecution, GetQueryResults, StopQueryExecution, ListTagsForResource) ride on a real DuckDB query engine. SELECT queries read Parquet, CSV, JSON, and ORC files from S3 through the Glue Catalog; CTAS and INSERT INTO write back to S3 and update the Glue Catalog. Athena is the only LocalEmu service that produces real Glue Catalog mutations as a side effect of running a query.

Operation-level coverage: see the Athena coverage matrix.

Quick start

# 1. Land some Parquet under an S3 prefix and register it in Glue.
$ awsemu s3 mb s3://lake
$ python3 - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq, boto3
t = pa.table({"order_id":[1,2,3],"amount":[10.0,20.0,30.5],"region":["us","eu","us"]})
pq.write_table(t, "/tmp/orders.parquet")
boto3.client("s3", endpoint_url="http://localhost:4566", region_name="us-east-1") \
     .upload_file("/tmp/orders.parquet", "lake", "orders/region=us/orders.parquet")
PY

$ awsemu glue create-database --database-input '{"Name":"sales"}'
$ awsemu glue create-table --database-name sales --table-input '{
      "Name":"orders",
      "StorageDescriptor":{
        "Location":"s3://lake/orders/",
        "Columns":[
          {"Name":"order_id","Type":"int"},
          {"Name":"amount","Type":"double"}
        ],
        "SerdeInfo":{"SerializationLibrary":"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"}
      },
      "PartitionKeys":[{"Name":"region","Type":"string"}]
    }'

# 2. Query through Athena. DuckDB does the actual read.
$ EXEC=$(awsemu athena start-query-execution \
    --query-string "SELECT region, SUM(amount) AS total FROM sales.orders GROUP BY region" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text)

$ awsemu athena get-query-execution --query-execution-id $EXEC \
    --query 'QueryExecution.Status.State' --output text
SUCCEEDED

$ awsemu athena get-query-results --query-execution-id $EXEC \
    --query 'ResultSet.Rows[].Data[].VarCharValue'
[
  "region","total",
  "eu","20.0",
  "us","40.5"
]

No persistent compute cluster, no separate awsemu athena start-session: a DuckDB connection is opened per query and closed after results are written. Queries execute synchronously inside StartQueryExecution, so GetQueryExecution already reports SUCCEEDED on the first poll.

Architecture

Code lives at services/athena/. The entry point is create_athena_service() at provider.py:567, wired into services/providers.py:471 and plux.ini as athena:default. The module is split into single-responsibility files:

File	Role
`provider.py`	Op handlers, SQL parsing + rewriting, DML dispatch, moto fallthrough.
`engine.py`	DuckDB connection management, httpfs/S3 endpoint config, dialect translation.
`glue_resolver.py`	Map a Glue table (StorageDescriptor + SerdeInfo) to a DuckDB read function.
`dml_handler.py`	CTAS + INSERT INTO planning, schema probing, COPY-to-S3, Glue table registration.
`result_writer.py`	Write rows as CSV + JSON metadata sidecar to `OutputLocation/<exec_id>.csv`.
`result_reader.py`	Rehydrate `QueryResult` from S3 after a process restart.
`registry.py`	Thread-safe in-memory `exec_id → QueryResult` map.

Query execution flow

Every StartQueryExecution walks this synchronous pipeline (provider.py:170-315):

Parse the SQL with sqlglot; classify as SELECT, CTAS, or INSERT INTO.
Resolve referenced tables against moto's glue_backends[account][region].
Translate each table reference into a DuckDB read function (see the format table below) using the Glue StorageDescriptor + SerdeInfo.
Rewrite the SQL: FROM db.table becomes FROM read_parquet('s3://...', hive_partitioning=1) (or the equivalent CSV/JSON/ORC reader).
Execute on a fresh in-memory DuckDB connection. The connection is configured with the httpfs extension pointed at LocalEmu's S3 endpoint (engine.py:44 reads ATHENA_S3_ENDPOINT_HOST and uses GATEWAY_LISTEN[0].port).
Persist results: result_writer writes a CSV plus a JSON metadata sidecar to ResultConfiguration.OutputLocation on S3.
Update state: the moto execution record flips directly to SUCCEEDED (or FAILED with a captured stderr) with the recorded execution_time_ms.

Glue resolver supported formats

The mapping table at glue_resolver.py:45-114:

Glue SerDe	DuckDB read function
`org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe`	`read_csv_auto()`
`org.apache.hadoop.hive.serde2.OpenCSVSerde`	`read_csv_auto()` with quote/escape options
`org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe`	`read_parquet()` with `hive_partitioning=1`
`org.openx.data.jsonserde.JsonSerDe` / HCatalog variant	`read_json_auto()`
`org.apache.hadoop.hive.ql.io.orc.OrcSerde`	`read_orc()`
Unknown SerDe	Falls back to `read_csv_auto()` with a warning.

Partition pruning is automatic: when the table has PartitionKeys and the data follows the key=value/ S3 layout, DuckDB's hive_partitioning=1 flag pushes predicate filters down to the file scanner so partitions outside the WHERE clause are never read.

CREATE TABLE AS (CTAS)

dml_handler.execute_ctas at dml_handler.py:217-270 turns a CTAS statement into a real S3 write plus a Glue catalog registration:

Probe the SELECT's output schema with SELECT * FROM (...) WHERE FALSE to capture column names and DuckDB types without running the query.
Execute COPY (...) TO 's3://...' (FORMAT PARQUET, PARTITION_BY (cols)) against the configured output location.
If the target Glue database does not exist, create it.
Register a new Glue table with the inferred StorageDescriptor (S3 location, columns, SerDe matching the chosen format) and PartitionKeys.

$ EXEC=$(awsemu athena start-query-execution \
    --query-string "CREATE TABLE sales.high_value WITH (
      external_location = 's3://lake/high_value/',
      format            = 'PARQUET',
      partitioned_by    = ARRAY['region']
    ) AS
    SELECT order_id, amount, region
    FROM   sales.orders
    WHERE  amount > 15" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text)

$ awsemu athena get-query-execution --query-execution-id $EXEC \
    --query 'QueryExecution.Status.State' --output text
SUCCEEDED

# Real Parquet files were written under the partitioned prefix.
$ awsemu s3 ls s3://lake/high_value/ --recursive
2026-05-20 19:01:42      482 high_value/region=eu/data_0.parquet
2026-05-20 19:01:42      491 high_value/region=us/data_0.parquet

# And the new table is registered in the Glue catalog with partition keys.
$ awsemu glue get-table --database-name sales --name high_value \
    --query 'Table.{Location:StorageDescriptor.Location,Parts:PartitionKeys[].Name}'
{"Location": "s3://lake/high_value/", "Parts": ["region"]}

After CTAS, awsemu glue get-table returns the registered table and subsequent Athena queries can SELECT FROM sales.high_value like any catalog-registered dataset.

INSERT INTO

dml_handler.execute_insert at dml_handler.py:275-324 appends new data to an existing table:

•
Resolves the target table from Glue (same SerDe/format detection as SELECT).
•
Writes a new UUID-named object under the existing S3 prefix, in the same format as the target.
•
Does not rewrite existing files (append-only).

$ awsemu athena start-query-execution \
    --query-string "INSERT INTO sales.orders VALUES (4, 99.0, 'eu')" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text >/dev/null

# A new UUID-named Parquet object lands under the existing prefix.
$ awsemu s3 ls s3://lake/orders/ --recursive
2026-05-20 19:02:11      491 orders/region=us/orders.parquet
2026-05-20 19:02:18      398 orders/region=eu/<uuid>.parquet

# A follow-up SELECT sees the new row through Glue + DuckDB.
$ EXEC=$(awsemu athena start-query-execution \
    --query-string "SELECT COUNT(*) FROM sales.orders WHERE region='eu'" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text)
$ awsemu athena get-query-results --query-execution-id $EXEC \
    --query 'ResultSet.Rows[1].Data[0].VarCharValue'
2

Result storage and pagination

•
Results live both in memory (AthenaResultRegistry) and on S3 (OutputLocation/<exec_id>.csv plus a .csv.metadata JSON sidecar).
•
After a LocalEmu process restart, GetQueryResults rehydrates the result set from S3 via result_reader. Old executions are still queryable as long as the result bucket survives.
•
GetQueryResults pagination: row 0 is the header row, MaxResults is clamped to 1-1000, NextToken is an opaque integer offset.

Configuration

Variable	Default	Purpose
`ATHENA_BACKEND`	`duckdb`	`duckdb` enables the real query engine. `off` / `none` / `disabled` / `moto` fall back to moto and return empty `Rows[]` on every query. Read at `provider.py:49`.
`ATHENA_S3_ENDPOINT_HOST`	`localhost`	Hostname used by DuckDB's `httpfs` extension to reach LocalEmu's S3. The port comes from `GATEWAY_LISTEN[0]`. Override when running inside Docker (`host.docker.internal`). Read at `engine.py:44`.

# If you do not need real query execution, fall back to the moto path
# which keeps the control plane intact and returns empty Rows[].
$ ATHENA_BACKEND=off localemu start

$ EXEC=$(awsemu athena start-query-execution \
    --query-string "SELECT 1" \
    --result-configuration "OutputLocation=s3://x/r/" \
    --query QueryExecutionId --output text)
$ awsemu athena get-query-results --query-execution-id $EXEC \
    --query 'ResultSet.Rows'
[]

Integration points

Service	How it touches Athena
S3	Reads source data via DuckDB `httpfs`. Writes results via boto3 + the LocalEmu S3 endpoint. CTAS writes Parquet/JSON files to the configured `external_location`.
Glue Catalog	Reads tables from moto's `glue_backends`; CTAS writes a new table back through `glue_backend.create_table()`, auto-creating the database when missing. This is the only Glue mutation in LocalEmu that is driven by another service.
CloudFormation	`AWS::Athena::WorkGroup`, `AWS::Athena::NamedQuery` resource providers wire through the moto control plane.

Known limitations

•
Execution is synchronous, not queued. Queries run inside StartQueryExecution and the response already shows SUCCEEDED. Code that depends on a QUEUED → RUNNING → SUCCEEDED transition (a poll loop with sleep) will see the terminal state on the first call. That is faster than AWS in practice but not what the AWS state machine documents.
•
Workgroups are metadata-only. CreateWorkGroup / UpdateWorkGroup store engine version, result encryption, bytes-scanned cutoff, and result-location overrides, but the query path does not enforce any of them. All queries run on DuckDB regardless of the workgroup's pinned engine version.
•
Named queries and prepared statements are metadata-only. They round-trip through CRUD but are not bound to the engine.
•
No federated data sources. Only AwsDataCatalog (Glue) is resolved. Connector-style data sources backed by Lambda functions are not invoked.
•
No User-Defined Functions. Lambda-backed UDFs are not invoked; USING FUNCTION ... falls through to DuckDB's parser and errors as unknown function.
•
No Iceberg or Hive ACID tables. The engine is positioned for read-mostly catalog-and-files workflows; ACID semantics, MERGE, and DELETE are not supported.
•
No bucketed tables. CLUSTERED BY ... INTO N BUCKETS is parsed but ignored on CTAS.
•
INSERT INTO does not register new partitions in Glue. Rows with a new partition-key value land in the right S3 prefix, but the Glue Partitions list is not extended. Subsequent queries still see the data (DuckDB walks the S3 prefix), but external tools that read partitions from Glue will miss them. Call glue:CreatePartition or MSCK REPAIR TABLE equivalent yourself.
•
No SparkSQL on Athena, no Notebooks, no Calculations. The session and calculation APIs round-trip metadata; no Spark runtime is started.
•
No capacity reservations. The reservation APIs return success but provision nothing.
•
No geospatial functions. DuckDB does support its own spatial extension, but it is not auto-loaded by LocalEmu's engine layer.
•
Glue Catalog itself is moto-metadata. Athena reads from it correctly, but crawlers and ETL jobs do not produce data; see the Known limitations page for the full Glue caveat.

← Amazon MSK ECS →