Docs / Athena

Athena

LocalEmu Athena implements all 70 operations. The control plane (workgroups, named queries, prepared statements, data catalogs, capacity reservations, sessions, calculations) is served by the moto-ext metadata backend. The five custom operations (StartQueryExecution, GetQueryExecution, GetQueryResults, StopQueryExecution, ListTagsForResource) ride on a real DuckDB query engine. SELECT queries read Parquet, CSV, JSON, and ORC files from S3 through the Glue Catalog; CTAS and INSERT INTO write back to S3 and update the Glue Catalog. Athena is the only LocalEmu service that produces real Glue Catalog mutations as a side effect of running a query.

Operation-level coverage: see the Athena coverage matrix.

Quick start

Terminal
# 1. Land some Parquet under an S3 prefix and register it in Glue.
$ awsemu s3 mb s3://lake
$ python3 - <<'PY'
import pyarrow as pa, pyarrow.parquet as pq, boto3
t = pa.table({"order_id":[1,2,3],"amount":[10.0,20.0,30.5],"region":["us","eu","us"]})
pq.write_table(t, "/tmp/orders.parquet")
boto3.client("s3", endpoint_url="http://localhost:4566", region_name="us-east-1") \
     .upload_file("/tmp/orders.parquet", "lake", "orders/region=us/orders.parquet")
PY

$ awsemu glue create-database --database-input '{"Name":"sales"}'
$ awsemu glue create-table --database-name sales --table-input '{
      "Name":"orders",
      "StorageDescriptor":{
        "Location":"s3://lake/orders/",
        "Columns":[
          {"Name":"order_id","Type":"int"},
          {"Name":"amount","Type":"double"}
        ],
        "SerdeInfo":{"SerializationLibrary":"org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe"}
      },
      "PartitionKeys":[{"Name":"region","Type":"string"}]
    }'

# 2. Query through Athena. DuckDB does the actual read.
$ EXEC=$(awsemu athena start-query-execution \
    --query-string "SELECT region, SUM(amount) AS total FROM sales.orders GROUP BY region" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text)

$ awsemu athena get-query-execution --query-execution-id $EXEC \
    --query 'QueryExecution.Status.State' --output text
SUCCEEDED

$ awsemu athena get-query-results --query-execution-id $EXEC \
    --query 'ResultSet.Rows[].Data[].VarCharValue'
[
  "region","total",
  "eu","20.0",
  "us","40.5"
]

No persistent compute cluster, no separate awsemu athena start-session: a DuckDB connection is opened per query and closed after results are written. Queries execute synchronously inside StartQueryExecution, so GetQueryExecution already reports SUCCEEDED on the first poll.

Architecture

Code lives at services/athena/. The entry point is create_athena_service() at provider.py:567, wired into services/providers.py:471 and plux.ini as athena:default. The module is split into single-responsibility files:

FileRole
provider.pyOp handlers, SQL parsing + rewriting, DML dispatch, moto fallthrough.
engine.pyDuckDB connection management, httpfs/S3 endpoint config, dialect translation.
glue_resolver.pyMap a Glue table (StorageDescriptor + SerdeInfo) to a DuckDB read function.
dml_handler.pyCTAS + INSERT INTO planning, schema probing, COPY-to-S3, Glue table registration.
result_writer.pyWrite rows as CSV + JSON metadata sidecar to OutputLocation/<exec_id>.csv.
result_reader.pyRehydrate QueryResult from S3 after a process restart.
registry.pyThread-safe in-memory exec_id → QueryResult map.

Query execution flow

Every StartQueryExecution walks this synchronous pipeline (provider.py:170-315):

  1. Parse the SQL with sqlglot; classify as SELECT, CTAS, or INSERT INTO.
  2. Resolve referenced tables against moto's glue_backends[account][region].
  3. Translate each table reference into a DuckDB read function (see the format table below) using the Glue StorageDescriptor + SerdeInfo.
  4. Rewrite the SQL: FROM db.table becomes FROM read_parquet('s3://...', hive_partitioning=1) (or the equivalent CSV/JSON/ORC reader).
  5. Execute on a fresh in-memory DuckDB connection. The connection is configured with the httpfs extension pointed at LocalEmu's S3 endpoint (engine.py:44 reads ATHENA_S3_ENDPOINT_HOST and uses GATEWAY_LISTEN[0].port).
  6. Persist results: result_writer writes a CSV plus a JSON metadata sidecar to ResultConfiguration.OutputLocation on S3.
  7. Update state: the moto execution record flips directly to SUCCEEDED (or FAILED with a captured stderr) with the recorded execution_time_ms.

Glue resolver supported formats

The mapping table at glue_resolver.py:45-114:

Glue SerDeDuckDB read function
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDeread_csv_auto()
org.apache.hadoop.hive.serde2.OpenCSVSerderead_csv_auto() with quote/escape options
org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDeread_parquet() with hive_partitioning=1
org.openx.data.jsonserde.JsonSerDe / HCatalog variantread_json_auto()
org.apache.hadoop.hive.ql.io.orc.OrcSerderead_orc()
Unknown SerDeFalls back to read_csv_auto() with a warning.

Partition pruning is automatic: when the table has PartitionKeys and the data follows the key=value/ S3 layout, DuckDB's hive_partitioning=1 flag pushes predicate filters down to the file scanner so partitions outside the WHERE clause are never read.

CREATE TABLE AS (CTAS)

dml_handler.execute_ctas at dml_handler.py:217-270 turns a CTAS statement into a real S3 write plus a Glue catalog registration:

  1. Probe the SELECT's output schema with SELECT * FROM (...) WHERE FALSE to capture column names and DuckDB types without running the query.
  2. Execute COPY (...) TO 's3://...' (FORMAT PARQUET, PARTITION_BY (cols)) against the configured output location.
  3. If the target Glue database does not exist, create it.
  4. Register a new Glue table with the inferred StorageDescriptor (S3 location, columns, SerDe matching the chosen format) and PartitionKeys.
Terminal
$ EXEC=$(awsemu athena start-query-execution \
    --query-string "CREATE TABLE sales.high_value WITH (
      external_location = 's3://lake/high_value/',
      format            = 'PARQUET',
      partitioned_by    = ARRAY['region']
    ) AS
    SELECT order_id, amount, region
    FROM   sales.orders
    WHERE  amount > 15" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text)

$ awsemu athena get-query-execution --query-execution-id $EXEC \
    --query 'QueryExecution.Status.State' --output text
SUCCEEDED

# Real Parquet files were written under the partitioned prefix.
$ awsemu s3 ls s3://lake/high_value/ --recursive
2026-05-20 19:01:42      482 high_value/region=eu/data_0.parquet
2026-05-20 19:01:42      491 high_value/region=us/data_0.parquet

# And the new table is registered in the Glue catalog with partition keys.
$ awsemu glue get-table --database-name sales --name high_value \
    --query 'Table.{Location:StorageDescriptor.Location,Parts:PartitionKeys[].Name}'
{"Location": "s3://lake/high_value/", "Parts": ["region"]}

After CTAS, awsemu glue get-table returns the registered table and subsequent Athena queries can SELECT FROM sales.high_value like any catalog-registered dataset.

INSERT INTO

dml_handler.execute_insert at dml_handler.py:275-324 appends new data to an existing table:

Terminal
$ awsemu athena start-query-execution \
    --query-string "INSERT INTO sales.orders VALUES (4, 99.0, 'eu')" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text >/dev/null

# A new UUID-named Parquet object lands under the existing prefix.
$ awsemu s3 ls s3://lake/orders/ --recursive
2026-05-20 19:02:11      491 orders/region=us/orders.parquet
2026-05-20 19:02:18      398 orders/region=eu/<uuid>.parquet

# A follow-up SELECT sees the new row through Glue + DuckDB.
$ EXEC=$(awsemu athena start-query-execution \
    --query-string "SELECT COUNT(*) FROM sales.orders WHERE region='eu'" \
    --result-configuration "OutputLocation=s3://lake/results/" \
    --query QueryExecutionId --output text)
$ awsemu athena get-query-results --query-execution-id $EXEC \
    --query 'ResultSet.Rows[1].Data[0].VarCharValue'
2

Result storage and pagination

Configuration

VariableDefaultPurpose
ATHENA_BACKENDduckdbduckdb enables the real query engine. off / none / disabled / moto fall back to moto and return empty Rows[] on every query. Read at provider.py:49.
ATHENA_S3_ENDPOINT_HOSTlocalhostHostname used by DuckDB's httpfs extension to reach LocalEmu's S3. The port comes from GATEWAY_LISTEN[0]. Override when running inside Docker (host.docker.internal). Read at engine.py:44.
Terminal
# If you do not need real query execution, fall back to the moto path
# which keeps the control plane intact and returns empty Rows[].
$ ATHENA_BACKEND=off localemu start

$ EXEC=$(awsemu athena start-query-execution \
    --query-string "SELECT 1" \
    --result-configuration "OutputLocation=s3://x/r/" \
    --query QueryExecutionId --output text)
$ awsemu athena get-query-results --query-execution-id $EXEC \
    --query 'ResultSet.Rows'
[]

Integration points

ServiceHow it touches Athena
S3Reads source data via DuckDB httpfs. Writes results via boto3 + the LocalEmu S3 endpoint. CTAS writes Parquet/JSON files to the configured external_location.
Glue CatalogReads tables from moto's glue_backends; CTAS writes a new table back through glue_backend.create_table(), auto-creating the database when missing. This is the only Glue mutation in LocalEmu that is driven by another service.
CloudFormationAWS::Athena::WorkGroup, AWS::Athena::NamedQuery resource providers wire through the moto control plane.

Known limitations