Background
If you use Amazon Athena to query data stored in Amazon S3 (for example, log files, or large datasets used for analysis), you may find yourself wanting to version-control table definitions using AWS CloudFormation. Since Athena uses the AWS Glue data catalog underneath, you can create Glue tables to populate your Athena databases.
(You should note that, with a bit of effort, you can likely adapt this information to e.g. Terraform's Glue table resource.)
Simple example
Using the CloudFront Logs example table, here's a resource definition suitable for a CloudFormation YAML template:
CloudfrontLogsTable:
Type: "AWS::Glue::Table"
Properties:
CatalogId: !Ref "AWS::AccountId"
DatabaseName: default
TableInput:
Name: cloudfront_logs
TableType: EXTERNAL_TABLE
StorageDescriptor:
Location: "s3://athena-examples-us-east-1/cloudfront/plaintext/"
StoredAsSubDirectories: true
InputFormat: org.apache.hadoop.mapred.TextInputFormat
OutputFormat: IgnoreKeyTextOutputFormat
SerdeInfo:
SerializationLibrary: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
Parameters:
field.delim: "\t"
serialization.format: "\t"
Columns:
- Name: Date
Type: date
- Name: Time
Type: string
- Name: Location
Type: string
- Name: Bytes
Type: int
- Name: RequestIP
Type: string
- Name: Method
Type: string
- Name: Host
Type: string
- Name: Uri
Type: string
- Name: Status
Type: int
- Name: Referrer
Type: string
- Name: ClientInfo
Type: string
Defining partition projection
Partition projection is supported here too. For example, on a CloudTrail logs table with partition projection:
CloudtrailLogsTable:
Properties:
TableInput:
Parameters:
projection.enabled: "true"
projection.timestamp.type: "date"
projection.timestamp.format: "yyyy/MM/dd"
projection.timestamp.interval: "1"
projection.timestamp.interval.unit: "DAYS"
projection.awsregion.type: "enum"
projection.awsregion.values: "us-east-1,us-east-2"
storage.location.template: !Sub "s3://my-cloudtrail-logs/AWSLogs/${AWS::AccountId}/CloudTrail/${!awsregion}/${!timestamp}"
PartitionKeys:
- Name: awsregion
Type: string
- Name: timestamp
Type: string