Reputation: 51
I am creating a parquet file from a CSV file using the following python script:
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
csv_file = '~/Desktop/SWA_UK_Pickup_Forecast_HOURLY_M1_at_2017-11-28-04_20_21-UTC_from_28-Nov-2017_to_28-Nov-2017.csv'
chunksize = 10
csv_stream = pd.read_csv(csv_file, sep=',', chunksize=chunksize, low_memory=False)
for i, chunk in enumerate(csv_stream):
if i == 0:
parquet_schema = pa.Table.from_pandas(df=chunk).schema
parquet_writer = pq.ParquetWriter('/Users/kohujwal/Desktop/parquet_forecast_file.parquet', parquet_schema, compression='snappy')
table = pa.Table.from_pandas(chunk, schema=parquet_schema);
parquet_writer.write_table(table)
parquet_writer.close();
Then I am manually uploading it to some s3 location, and creating a table over it in Athena. The query used to create the table:
CREATE EXTERNAL TABLE IF NOT EXISTS forecast_report_lom_parquet (
`forecast_week` int,
`for_date` string,
`forecast_day_of_week` string,
`merchant_id` string,
`shipper_account_id` string,
`shipper_name` string,
`node_id` string,
`routing_preference` string,
`forecast_shipment_count` int,
`forecast_pallet_count` int,
`forecast_volume` double )
STORED AS PARQUET
LOCATION 's3://prevoir-athena-input-prod/spike-computation/forecast-reports/parquet/live-order-model/'
tblproperties ("parquet.compress"="SNAPPY")
The table is created successfully, however when I try to query the table, it doesn't display the contents. It just shows the indexes of the rows and the column names. Here is the CSV and the parquet file. The query on the parquet table :
SELECT * FROM prevoir_prod.forecast_report_lom_parquet
Results that I am seeing on the console.
Can anyone point out what's going wrong here?
Upvotes: 1
Views: 3759
Reputation: 28738
Your results show row numbers. This means that Athena is finding the file(s) on S3, and is parsing them to the point of identifying rows.
It seems that Athena (or more precisely, the ParquetSerDe) isn't able to get columns from your file. This points to a mismatch between the CREATE EXTERNAL TABLE
statement and the actual file. Some possibilities:
Whenever I have a file that I can't create a table across, I'll use the Glue crawler to inspect and build the create DDL. Works very well, it's straightforward, and saves a bunch of trial and error.
Upvotes: 1