Defender — Objective 6: Use Athena¶
What is Athena?¶
Amazon Athena is a serverless, interactive query service that lets you run standard SQL directly against data sitting in S3 — with no database, no ETL, no data loading. You point it at an S3 prefix, define the schema, and run SQL. Athena reads the raw files on demand.
How it works: 1. Data already lives in S3 (JSON, CSV, Parquet, ORC, etc.) 2. You define a table in Athena specifying the S3 location and schema 3. Athena uses that schema to parse and query the raw files 4. You run SQL. Athena reads only what's needed, charges per TB scanned
Why use Athena for CloudTrail?
CloudTrail logs are gzipped JSON files scattered across S3 directories. Normally you'd download them, decompress, and parse with jq — manual, slow, and limited to what's already downloaded. Athena lets you query the logs directly in place with SQL — no download, no script, no database to manage. For large orgs with months of logs across dozens of accounts and regions, Athena is the only practical option.
Cost: Fractions of a cent for small datasets. For large production use, partitioning (covered below) reduces cost by 99%.
What is AWS Glue?¶
Athena doesn't store table definitions itself — it uses the AWS Glue Data Catalog.
AWS Glue has two parts: - Data Catalog — a managed metadata store that holds table definitions, schemas, and S3 location mappings. This is what Athena uses. - ETL service — jobs that transform and move data. Not relevant here.
When you run CREATE TABLE in Athena's query editor, you're actually writing a table definition into Glue's Data Catalog. When you run a query, Athena reads the definition from Glue, then reads the matching S3 files.
Architecture:
Raw data files (.json.gz) in S3
↑
Glue Data Catalog (table definition: schema + S3 location)
↑
Athena SQL engine (parses S3 files using the Glue schema)
↑
You (writing SQL queries)
Glue is the "schema registry." Athena is the "query engine." S3 is the "storage layer." All three together = serverless data lake.
Step 1 — Create the Database¶
In the Athena query editor:
Creates a logical namespace in Glue's Data Catalog. Equivalent to a database in MySQL/Postgres — just a container for tables. No physical storage is created. Tables you create will live under this database.
Switch to the flaws2 database in the Athena UI before running the next query.
Step 2 — Create the Table¶
This defines the schema that maps CloudTrail's JSON structure to SQL columns:
CREATE EXTERNAL TABLE `cloudtrail`(
`eventversion` string,
`useridentity` struct<
type:string,
principalid:string,
arn:string,
accountid:string,
invokedby:string,
accesskeyid:string,
username:string,
sessioncontext:struct<
attributes:struct<
mfaauthenticated:string,
creationdate:string
>,
sessionissuer:struct<
type:string,
principalid:string,
arn:string,
accountid:string,
username:string
>
>
>,
`eventtime` string,
`eventsource` string,
`eventname` string,
`awsregion` string,
`sourceipaddress` string,
`useragent` string,
`errorcode` string,
`errormessage` string,
`requestparameters` string,
`responseelements` string,
`requestid` string,
`eventid` string,
`resources` array<struct<arn:string,accountid:string,type:string>>,
`eventtype` string,
`readonly` string,
`recipientaccountid` string
)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://flaws2-logs/AWSLogs/653711331788/CloudTrail';
Breaking every piece down:
CREATE EXTERNAL TABLE¶
EXTERNAL = the data lives in S3, outside Athena/Glue. Athena doesn't own the data. Dropping this table only removes the definition from Glue — it does not delete the S3 files. Safe to drop and recreate without touching the underlying logs.
Column types: string, struct, array¶
| Type | Used for |
|---|---|
string |
Simple fields — eventtime, eventname, sourceipaddress, etc. |
struct<field:type, ...> |
Nested JSON objects — useridentity contains sub-fields |
array<struct<...>> |
JSON arrays of objects — resources is a list of resource objects |
Nested structs: CloudTrail's userIdentity is a JSON object containing sessionContext which contains sessionIssuer which contains more fields. You represent this as:
This is how Athena maps deeply nested JSON to queryable columns.
ROW FORMAT SERDE¶
SerDe = Serializer/Deserializer. Tells Athena how to parse each "row" from the raw file.
CloudTrail logs have two quirks the standard JSON SerDe can't handle:
1. Events are wrapped inside a Records array — not one event per line
2. Files are gzip-compressed
The CloudTrail SerDe (com.amazon.emr.hive.serde.CloudTrailSerde) handles both automatically — it unwraps the Records array so each element becomes one row, and handles the gzip transparently.
STORED AS INPUTFORMAT ... OUTPUTFORMAT¶
Hive-style file format specification. Always use these exact values for CloudTrail:
- INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' — tells Athena how to split and read the files
- OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' — how results are written back (needed for the table definition even though you won't write)
These are inherited from Hive (the original Hadoop SQL layer that Athena is built on). Just treat them as boilerplate for CloudTrail tables.
LOCATION¶
The S3 prefix where the log files live. Athena reads all files under this prefix, recursively. So all the date-partitioned subdirectories (us-east-1/2018/11/28/) are included automatically.
Step 3 — Query the Data¶
Simple query — what events happened:
Standard SQL. Returns two columns from all records. Athena reads every file under the LOCATION prefix.
Aggregation — count by event type:
Best first query in any investigation. Shows you what happened most. High GetObject counts = lots of S3 reads. BatchGetImage = someone pulled a container image. ListBuckets = recon.
Filter by attacker IP:
SELECT eventtime, eventname, useridentity.arn
FROM cloudtrail
WHERE sourceipaddress = '104.102.221.250'
ORDER BY eventtime;
Reconstructs exactly what the attacker did and when.
Filter by identity type — find all assumed roles:
SELECT eventtime, sourceipaddress, eventname, useridentity.arn
FROM cloudtrail
WHERE useridentity.type = 'AssumedRole'
ORDER BY eventtime;
Focus on role-based activity, filter out anonymous browser traffic and AWSService events.
Find failed/denied API calls:
SELECT eventtime, eventname, useridentity.arn, errorcode, errormessage
FROM cloudtrail
WHERE errorcode IS NOT NULL
ORDER BY eventtime;
AccessDenied errors can indicate an attacker testing what they can do with stolen credentials.
Accessing nested fields in SQL: CloudTrail's useridentity is a struct. Access nested fields with dot notation in Athena: useridentity.arn, useridentity.sessioncontext.sessionissuer.username. Same dot-notation as jq, just inside SQL.
Partitioning (Critical for Production)¶
This exercise uses a tiny dataset — scanning everything costs almost nothing. In production, CloudTrail generates gigabytes per day across multiple accounts and regions. Athena charges per TB scanned. Without partitioning, every query reads your entire log history.
What partitioning does: Tells Athena to only read files from specific S3 paths based on WHERE clause values. You define partition columns that map to the S3 directory structure:
With partitions defined, a query filtering WHERE year='2024' AND month='01' AND day='15' only reads that one day's files — ignoring the rest of your history.
-- Without partitions: scans all data ever
SELECT * FROM cloudtrail WHERE eventtime LIKE '2024-01-15%'
-- With partitions: Athena only reads 2024/01/15 files from S3
SELECT * FROM cloudtrail WHERE year='2024' AND month='01' AND day='15'
For a large account with a year of logs, partitioning can reduce query cost by 99% for date-bounded investigations.
Reference: Partitioning CloudTrail Logs in Athena — Alex Smolen
jq vs Athena — When to Use Each¶
| Scenario | Use |
|---|---|
| Quick investigation, already have logs downloaded | jq — no setup, immediate results |
| Logs still in S3, large volumes, multiple days | Athena — query in place, no download needed |
| Sharing queries with team | Athena — queries are saveable and reproducible |
| Feeding results into dashboards (QuickSight, Grafana) | Athena — direct integration |
| Learning and ad-hoc exploration | jq — faster iteration cycle |
| Multi-account investigation (logs from 20 accounts in one bucket) | Athena — SQL across the whole dataset at once |
Vulnerability Summary for the Full Attack¶
What the defender track teaches:
Every step of the attack from the Attacker track left evidence in CloudTrail. The investigator's job is to find the anomalies, trace them back, and understand the full chain.
Reconstructed attack chain:
1. Attacker browsed the site — ANONYMOUS_PRINCIPAL GetObject events
2. Attacker sent malformed input to the Lambda API endpoint — Lambda crashed and dumped env vars including level1 credentials
3. Attacker used level1 credentials to pull the level2 ECR image — ListImages, BatchGetImage, GetDownloadUrlForLayer events
4. Attacker ran docker history — found the htpasswd password in a layer
5. Attacker logged into the ECS web app with those credentials
6. Attacker used SSRF to hit the ECS metadata endpoint at 169.254.170.2 — read level3 temporary credentials
7. Attacker used level3 credentials from their own laptop — ListBuckets event with non-AWS IP and macOS user-agent
Signals that should trigger alerts:
- Any ECS/Lambda role used from a non-AWS IP address
- aws-cli user agent on a role assigned to a container (containers run Linux and use SDKs, not the CLI)
- Same IP appearing in both ANONYMOUS_PRINCIPAL events and AssumedRole events within a short window
- Broad recon calls (ListBuckets, DescribeInstances) from roles designed for specific applications
Fixes:
1. Don't dump env vars in Lambda error responses — use structured logging to CloudWatch
2. Never set Principal: "*" on ECR repos — restrict to specific accounts/roles
3. Enable IMDSv2 and restrict container metadata endpoint access
4. Apply least privilege — an ECS web app doesn't need s3:ListBuckets
5. Enable GuardDuty — it automates credential exfiltration detection
6. Aggregate CloudTrail to a separate Security account — attackers can't tamper with logs they can't access