regretoverflow
regretoverflow

Reputation: 2153

AWS Athena mis-interpreting timestamp column

I'm processing CSV files, outputting parquet files using Pandas in an AWS Lambda function, saving the data to an S3 bucket to query with Athena. The RAW input format to the Lambda function is CSV, with a unix timestamp in UTC that looks like:

Timestamp,DeviceName,DeviceUUID,SignalName,SignalValueRaw,SignalValueScaled,SignalType,Valid
1605074410110,F2016B1E.CAP.0 - 41840982B40192,323da038-bb49-4f3a-a045-925194364e5b,X.ALM.FLG,0,0,INTEGER,true

I parse the Timestamp like:

df['Timestamp'] = pd.to_datetime(df['Timestamp'], unit='ms')
df.head()

    Timestamp               DeviceName                      DeviceUUID                          SignalName  SignalValueRaw  SignalValueScaled   SignalType  SubstationId    StationBankId   FeederId    year    month   day hour    DeviceNameClean DeviceType
0   2020-11-11 06:00:10.110 F2016B2W.MLR.0 - 41841005000073 3c4839b1-ab99-4164-b415-4653948360ef    CVR_X_ENGAGED_A 0   0   BOOLEAN Kenton  FR2016B2    F2016B2W    2020    11  11  6   MLR.0 - 41841005000073  MLR

I process the data further in the Lambda function, then output a parquet file. I then run a Glue crawler against the parquet files that this script outputs, and in S3, can query the data fine:

2020-11-14T05:00:43.609Z,02703ee8-b08a-4c49-9581-706f905aa192,FR22607.REG.0,REG,REG.0,ROSS,FR22607,,0,0,0,0,0,0,0,0,,0.0,,,,0.0,,,,1.0,,

The glue crawler correctly identifies the column as timestamp:

CREATE EXTERNAL TABLE `cvr_event_log`(
  `timestamp` timestamp, 
  `deviceuuid` string, 
  `devicename` string, 
  `devicetype` string, 
...

But when I then query the table in Athena, I get this for the date:

"timestamp","deviceuuid","devicename","devicetype",
"+52840-11-19 16:56:55.000","0ca4ed37-930d-4778-b3a8-f49d9b498364","FR22606.REG.0","REG",

What has Athena so confused about the timestamp?

Upvotes: 0

Views: 1532

Answers (1)

Theo
Theo

Reputation: 132972

For a TIMESTAMP column to work in Athena you need to use a specific format, which unfortunately is not ISO 8601. It looks like this: "2020-11-14 20:33:42".

You can use from_iso8601_timestamp(ts) to parse ISO 8601 timestamps in queries.

Glue crawlers sadly misinterprets things quite often and creates tables that don't work properly with Athena.

Upvotes: 1

Related Questions