Ujwal Kohad
Ujwal Kohad

Reputation: 51

Creating table over written parquet file in Athena

I am creating a parquet file from a CSV file using the following python script:

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq

csv_file = '~/Desktop/SWA_UK_Pickup_Forecast_HOURLY_M1_at_2017-11-28-04_20_21-UTC_from_28-Nov-2017_to_28-Nov-2017.csv'

chunksize = 10

csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False)

for i, chunk in enumerate(csv_stream):
    if i == 0:
        parquet_schema = pa.Table.from_pandas(df=chunk).schema
        parquet_writer = pq.ParquetWriter('/Users/kohujwal/Desktop/parquet_forecast_file.parquet', parquet_schema, compression='snappy')
    table = pa.Table.from_pandas(chunk, schema=parquet_schema);
    parquet_writer.write_table(table)

parquet_writer.close();

Then I am manually uploading it to some s3 location, and creating a table over it in Athena. The query used to create the table:

CREATE EXTERNAL TABLE IF NOT EXISTS forecast_report_lom_parquet (
  `forecast_week` int,
  `for_date` string,
  `forecast_day_of_week` string,
  `merchant_id` string,
  `shipper_account_id` string,
  `shipper_name` string,
  `node_id` string,
  `routing_preference` string,
  `forecast_shipment_count` int,
  `forecast_pallet_count` int,
  `forecast_volume` double )
STORED AS PARQUET
LOCATION 's3://prevoir-athena-input-prod/spike-computation/forecast-reports/parquet/live-order-model/'
tblproperties ("parquet.compress"="SNAPPY")

The table is created successfully, however when I try to query the table, it doesn't display the contents. It just shows the indexes of the rows and the column names. Here is the CSV and the parquet file. The query on the parquet table :

SELECT * FROM prevoir_prod.forecast_report_lom_parquet

Results that I am seeing on the console.

Can anyone point out what's going wrong here?

Upvotes: 1

Views: 3759

Answers (1)

Kirk Broadhurst
Kirk Broadhurst

Reputation: 28738

Your results show row numbers. This means that Athena is finding the file(s) on S3, and is parsing them to the point of identifying rows.

It seems that Athena (or more precisely, the ParquetSerDe) isn't able to get columns from your file. This points to a mismatch between the CREATE EXTERNAL TABLE statement and the actual file. Some possibilities:

  1. The column names or datatypes are wrong
  2. File isn't compressed as expected (i.e. try without compression)

Whenever I have a file that I can't create a table across, I'll use the Glue crawler to inspect and build the create DDL. Works very well, it's straightforward, and saves a bunch of trial and error.

Upvotes: 1

Related Questions