Athena
LocalEmu Athena implements all 70 operations. The control plane (workgroups, named queries, prepared statements, data catalogs, capacity reservations, sessions, calculations) is served by the moto-ext metadata backend. The five custom operations (StartQueryExecution, GetQueryExecution, GetQueryResults, StopQueryExecution, ListTagsForResource) ride on a real DuckDB query engine. SELECT queries read Parquet, CSV, JSON, and ORC files from S3 through the Glue Catalog; CTAS and INSERT INTO write back to S3 and update the Glue Catalog. Athena is the only LocalEmu service that produces real Glue Catalog mutations as a side effect of running a query.
Operation-level coverage: see the Athena coverage matrix.
Quick start
# 1. Land some Parquet under an S3 prefix and register it in Glue.
$ awsemu s3 mb s3://lake
$ python3 - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq, boto3
t = pa.table({"order_id":[1,2,3],"amount":[10.0,20.0,30.5],"region":["us","eu","us"]})
pq.write_table(t, "/tmp/orders.parquet")
boto3.client("s3", endpoint_url="http://localhost:4566", region_name="us-east-1") \
.upload_file("/tmp/orders.parquet", "lake", "orders/region=us/orders.parquet")
PY
$ awsemu glue create-database --database-input '{"Name":"sales"}'
$ awsemu glue create-table --database-name sales --table-input '{
"Name":"orders",
"StorageDescriptor":{
"Location":"s3://lake/orders/",
"Columns":[
{"Name":"order_id","Type":"int"},
{"Name":"amount","Type":"double"}
],
"SerdeInfo":{"SerializationLibrary":"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"}
},
"PartitionKeys":[{"Name":"region","Type":"string"}]
}'
# 2. Query through Athena. DuckDB does the actual read.
$ EXEC=$(awsemu athena start-query-execution \
--query-string "SELECT region, SUM(amount) AS total FROM sales.orders GROUP BY region" \
--result-configuration "OutputLocation=s3://lake/results/" \
--query QueryExecutionId --output text)
$ awsemu athena get-query-execution --query-execution-id $EXEC \
--query 'QueryExecution.Status.State' --output text
SUCCEEDED
$ awsemu athena get-query-results --query-execution-id $EXEC \
--query 'ResultSet.Rows[].Data[].VarCharValue'
[
"region","total",
"eu","20.0",
"us","40.5"
] No persistent compute cluster, no separate awsemu athena start-session: a DuckDB connection is opened per query and closed after results are written. Queries execute synchronously inside StartQueryExecution, so GetQueryExecution already reports SUCCEEDED on the first poll.
Architecture
Code lives at services/athena/. The entry point is create_athena_service() at provider.py:567, wired into services/providers.py:471 and plux.ini as athena:default. The module is split into single-responsibility files:
| File | Role |
|---|---|
provider.py | Op handlers, SQL parsing + rewriting, DML dispatch, moto fallthrough. |
engine.py | DuckDB connection management, httpfs/S3 endpoint config, dialect translation. |
glue_resolver.py | Map a Glue table (StorageDescriptor + SerdeInfo) to a DuckDB read function. |
dml_handler.py | CTAS + INSERT INTO planning, schema probing, COPY-to-S3, Glue table registration. |
result_writer.py | Write rows as CSV + JSON metadata sidecar to OutputLocation/<exec_id>.csv. |
result_reader.py | Rehydrate QueryResult from S3 after a process restart. |
registry.py | Thread-safe in-memory exec_id → QueryResult map. |
Query execution flow
Every StartQueryExecution walks this synchronous pipeline (provider.py:170-315):
- Parse the SQL with sqlglot; classify as SELECT, CTAS, or INSERT INTO.
- Resolve referenced tables against moto's
glue_backends[account][region]. - Translate each table reference into a DuckDB read function (see the format table below) using the Glue StorageDescriptor + SerdeInfo.
- Rewrite the SQL:
FROM db.tablebecomesFROM read_parquet('s3://...', hive_partitioning=1)(or the equivalent CSV/JSON/ORC reader). - Execute on a fresh in-memory DuckDB connection. The connection is configured with the httpfs extension pointed at LocalEmu's S3 endpoint (
engine.py:44readsATHENA_S3_ENDPOINT_HOSTand usesGATEWAY_LISTEN[0].port). - Persist results:
result_writerwrites a CSV plus a JSON metadata sidecar toResultConfiguration.OutputLocationon S3. - Update state: the moto execution record flips directly to
SUCCEEDED(orFAILEDwith a captured stderr) with the recordedexecution_time_ms.
Glue resolver supported formats
The mapping table at glue_resolver.py:45-114:
| Glue SerDe | DuckDB read function |
|---|---|
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe | read_csv_auto() |
org.apache.hadoop.hive.serde2.OpenCSVSerde | read_csv_auto() with quote/escape options |
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe | read_parquet() with hive_partitioning=1 |
org.openx.data.jsonserde.JsonSerDe / HCatalog variant | read_json_auto() |
org.apache.hadoop.hive.ql.io.orc.OrcSerde | read_orc() |
| Unknown SerDe | Falls back to read_csv_auto() with a warning. |
Partition pruning is automatic: when the table has PartitionKeys and the data follows the key=value/ S3 layout, DuckDB's hive_partitioning=1 flag pushes predicate filters down to the file scanner so partitions outside the WHERE clause are never read.
CREATE TABLE AS (CTAS)
dml_handler.execute_ctas at dml_handler.py:217-270 turns a CTAS statement into a real S3 write plus a Glue catalog registration:
- Probe the SELECT's output schema with
SELECT * FROM (...) WHERE FALSEto capture column names and DuckDB types without running the query. - Execute
COPY (...) TO 's3://...' (FORMAT PARQUET, PARTITION_BY (cols))against the configured output location. - If the target Glue database does not exist, create it.
- Register a new Glue table with the inferred
StorageDescriptor(S3 location, columns, SerDe matching the chosen format) andPartitionKeys.
$ EXEC=$(awsemu athena start-query-execution \
--query-string "CREATE TABLE sales.high_value WITH (
external_location = 's3://lake/high_value/',
format = 'PARQUET',
partitioned_by = ARRAY['region']
) AS
SELECT order_id, amount, region
FROM sales.orders
WHERE amount > 15" \
--result-configuration "OutputLocation=s3://lake/results/" \
--query QueryExecutionId --output text)
$ awsemu athena get-query-execution --query-execution-id $EXEC \
--query 'QueryExecution.Status.State' --output text
SUCCEEDED
# Real Parquet files were written under the partitioned prefix.
$ awsemu s3 ls s3://lake/high_value/ --recursive
2026-05-20 19:01:42 482 high_value/region=eu/data_0.parquet
2026-05-20 19:01:42 491 high_value/region=us/data_0.parquet
# And the new table is registered in the Glue catalog with partition keys.
$ awsemu glue get-table --database-name sales --name high_value \
--query 'Table.{Location:StorageDescriptor.Location,Parts:PartitionKeys[].Name}'
{"Location": "s3://lake/high_value/", "Parts": ["region"]} After CTAS, awsemu glue get-table returns the registered table and subsequent Athena queries can SELECT FROM sales.high_value like any catalog-registered dataset.
INSERT INTO
dml_handler.execute_insert at dml_handler.py:275-324 appends new data to an existing table:
- •Resolves the target table from Glue (same SerDe/format detection as SELECT).
- •Writes a new UUID-named object under the existing S3 prefix, in the same format as the target.
- •Does not rewrite existing files (append-only).
$ awsemu athena start-query-execution \
--query-string "INSERT INTO sales.orders VALUES (4, 99.0, 'eu')" \
--result-configuration "OutputLocation=s3://lake/results/" \
--query QueryExecutionId --output text >/dev/null
# A new UUID-named Parquet object lands under the existing prefix.
$ awsemu s3 ls s3://lake/orders/ --recursive
2026-05-20 19:02:11 491 orders/region=us/orders.parquet
2026-05-20 19:02:18 398 orders/region=eu/<uuid>.parquet
# A follow-up SELECT sees the new row through Glue + DuckDB.
$ EXEC=$(awsemu athena start-query-execution \
--query-string "SELECT COUNT(*) FROM sales.orders WHERE region='eu'" \
--result-configuration "OutputLocation=s3://lake/results/" \
--query QueryExecutionId --output text)
$ awsemu athena get-query-results --query-execution-id $EXEC \
--query 'ResultSet.Rows[1].Data[0].VarCharValue'
2 Result storage and pagination
- •Results live both in memory (
AthenaResultRegistry) and on S3 (OutputLocation/<exec_id>.csvplus a.csv.metadataJSON sidecar). - •After a LocalEmu process restart,
GetQueryResultsrehydrates the result set from S3 viaresult_reader. Old executions are still queryable as long as the result bucket survives. - •
GetQueryResultspagination: row 0 is the header row,MaxResultsis clamped to 1-1000,NextTokenis an opaque integer offset.
Configuration
| Variable | Default | Purpose |
|---|---|---|
ATHENA_BACKEND | duckdb | duckdb enables the real query engine. off / none / disabled / moto fall back to moto and return empty Rows[] on every query. Read at provider.py:49. |
ATHENA_S3_ENDPOINT_HOST | localhost | Hostname used by DuckDB's httpfs extension to reach LocalEmu's S3. The port comes from GATEWAY_LISTEN[0]. Override when running inside Docker (host.docker.internal). Read at engine.py:44. |
# If you do not need real query execution, fall back to the moto path
# which keeps the control plane intact and returns empty Rows[].
$ ATHENA_BACKEND=off localemu start
$ EXEC=$(awsemu athena start-query-execution \
--query-string "SELECT 1" \
--result-configuration "OutputLocation=s3://x/r/" \
--query QueryExecutionId --output text)
$ awsemu athena get-query-results --query-execution-id $EXEC \
--query 'ResultSet.Rows'
[] Integration points
| Service | How it touches Athena |
|---|---|
| S3 | Reads source data via DuckDB httpfs. Writes results via boto3 + the LocalEmu S3 endpoint. CTAS writes Parquet/JSON files to the configured external_location. |
| Glue Catalog | Reads tables from moto's glue_backends; CTAS writes a new table back through glue_backend.create_table(), auto-creating the database when missing. This is the only Glue mutation in LocalEmu that is driven by another service. |
| CloudFormation | AWS::Athena::WorkGroup, AWS::Athena::NamedQuery resource providers wire through the moto control plane. |
Known limitations
- •Execution is synchronous, not queued. Queries run inside
StartQueryExecutionand the response already showsSUCCEEDED. Code that depends on aQUEUED → RUNNING → SUCCEEDEDtransition (a poll loop with sleep) will see the terminal state on the first call. That is faster than AWS in practice but not what the AWS state machine documents. - •Workgroups are metadata-only.
CreateWorkGroup/UpdateWorkGroupstore engine version, result encryption, bytes-scanned cutoff, and result-location overrides, but the query path does not enforce any of them. All queries run on DuckDB regardless of the workgroup's pinned engine version. - •Named queries and prepared statements are metadata-only. They round-trip through CRUD but are not bound to the engine.
- •No federated data sources. Only
AwsDataCatalog(Glue) is resolved. Connector-style data sources backed by Lambda functions are not invoked. - •No User-Defined Functions. Lambda-backed UDFs are not invoked;
USING FUNCTION ...falls through to DuckDB's parser and errors as unknown function. - •No Iceberg or Hive ACID tables. The engine is positioned for read-mostly catalog-and-files workflows; ACID semantics, MERGE, and DELETE are not supported.
- •No bucketed tables.
CLUSTERED BY ... INTO N BUCKETSis parsed but ignored on CTAS. - •INSERT INTO does not register new partitions in Glue. Rows with a new partition-key value land in the right S3 prefix, but the Glue
Partitionslist is not extended. Subsequent queries still see the data (DuckDB walks the S3 prefix), but external tools that read partitions from Glue will miss them. Callglue:CreatePartitionorMSCK REPAIR TABLEequivalent yourself. - •No SparkSQL on Athena, no Notebooks, no Calculations. The session and calculation APIs round-trip metadata; no Spark runtime is started.
- •No capacity reservations. The reservation APIs return success but provision nothing.
- •No geospatial functions. DuckDB does support its own spatial extension, but it is not auto-loaded by LocalEmu's engine layer.
- •Glue Catalog itself is moto-metadata. Athena reads from it correctly, but crawlers and ETL jobs do not produce data; see the Known limitations page for the full Glue caveat.