alex-mon888
alex-mon888

Reputation: 51

Columns to Struct parquet by pyarrow and (or) pandas in Python

I hope some of you will find a little bit time to help a beginner like me. I am working on this task for the full week and cannot find a solution. I understand and completely fine with that I have to learn each package I use and their combinations to find the right solution.

The full task is to combine 5 columns (1000 rows) into 1 struct column and have it stored/transformed in a row (1000 columns) in parquet. But I stuck with the problem to combine 5 columns into 1 struct column.

Originally, I receive the following columns: columns=['date', 'bidopen', 'bidclose', 'bidhigh', 'bidlow', 'tickqty']. I do not need 'date' to be part of the struct.

What I tried:

import pyarrow as pa
import pyarrow.parquet as pq
import pandas as pd

Option 1 - dictionary with pandas

df = pd.read_csv('original.csv')
df2 = df.drop(columns=['date'])
df3 = df2.to_dict('records')

I cannot save dict to csv or parquet by pandas - the following 2 commands lead to backward transformation to pandas dataframe and saving columns separately.

pd.DataFrame(df3).to_csv('test_parquet.csv', index=False)
pd.DataFrame(df3).to_parquet('test2.parquet')

If I could use dictionary as a dataframe, next I would use pandas.DataFrame.pivot to turn rows into columns. Next, I tried to convert dict to the pyarrow table (seems like potentially I could also save entries in columns (1 row)).

table = pa.Table.from_pydict({'data', pa.array(df3)})

After the line above I have an error and I could not find a solution (TypeError: unhashable type: 'pyarrow.lib.StructArray'). The next step would be to save the table into parquet by pyarrow.

Option 2 - to struct by pyarrow

Here I tried to work inside parquet to change the schema (or write into new schema)

df = pd.read_csv('original.csv')
df = df.drop(columns=['date'])
df.to_parquet('test.parquet')
table = pq.read_table('test.parquet', columns=['bidopen', 'bidclose', 'bidhigh', 'bidlow', 'tickqty'])

Here I read the schema of the parquet to see the DataType of each column. Below I set the new schema:

struct = pa.struct([
    pa.field('bidopen', pa.float64()),
    pa.field('bidclose', pa.float64()),
    pa.field('bidhigh', pa.float64()),
    pa.field('bidlow', pa.float64()),
    pa.field('tickqty', pa.int64())
])
fields = ([pa.field('data', pa.list_(struct))])
schema = pa.schema(fields)
writer = pq.ParquetWriter('test2.parquet', schema)
writer.write_table(table)
writer.close()

I got an error for which I also could not find a solution (ValueError: Table schema does not match schema used to create file:...) as I thought it would save into the new provided schema.

Option 3 - pyarrow cast

#(the upper part is from the Option 2)
...
schema = pa.schema(fields)
table2 = table.cast(schema)
writer = pq.ParquetWriter('test2.parquet', schema)
writer.write_table(table2)
writer.close()

I got another error (ValueError: Target schema's field names are not matching the table's field names:). Here I said - c'mon, I am doing cast exactly because the schemas are not the same... That didn't help.

Option 4 - another attempt to change the schema when loading from pandas to pyarrow to save it later to parquet

arrays = [['data','data','data','data','data'],['bidopen', 'bidclose','bidhigh','bidlow','tickqty']]
tuples = list(zip(*arrays))
index = pd.MultiIndex.from_tuples(tuples)
df2 = pd.DataFrame(df.values[:, 1:], columns=index)
pa.Schema.from_pandas(df2)

Here I got an error (AttributeError: 'list' object has no attribute 'columns') which I also could not find a solution to.

Option 5 - pyspark

Here was the biggest oops for me as I spend about 3 days to "learn" it as it should be able to do the conversion to struct and pivoting. BUT, I found out later that I cannot save data into parquet by pyspark on my Win10 without additional packages: Hadoop and Java SDK (it is not free to use). Therefore, I stopped developing it further.

Upvotes: 2

Views: 7712

Answers (1)

Pace
Pace

Reputation: 43787

For the first part of your question you can do this (note, StructArray.from_arrays expects arrays so you need to flatten the chunked arrays):

fields, arrs = [], []
for column_index in range(table.num_columns):
    fields.append(table.field(column_index))
    arrs.append(table.column(column_index).flatten()[0].chunks[0])
struct_array = pa.StructArray.from_arrays(arrs, fields=fields)
print(struct_array)
print(struct_array.to_pylist())

Sample output:

-- is_valid: all not null
-- child 0 type: double
  [
    1.1,
    2.2
  ]
-- child 1 type: double
  [
    3.3,
    4.4
  ]
-- child 2 type: double
  [
    5.5,
    6.6
  ]
-- child 3 type: double
  [
    7.7,
    8.8
  ]
-- child 4 type: int64
  [
    9,
    10
  ]
[{'bidopen': 1.1, 'bidclose': 3.3, 'bidhigh': 5.5, 'bidlow': 7.7, 'tickqty': 9}, {'bidopen': 2.2, 'bidclose': 4.4, 'bidhigh': 6.6, 'bidlow': 8.8, 'tickqty': 10}]

I don't think pyarrow can transpose if that is what you are asking for the second part of your question. You could use pandas to do the transpose but it would be another copy.

df = pa.Table.from_arrays([struct_array], ['data']).to_pandas()
print(df.transpose())

Sample output:

                                                      0  \
data  {'bidopen': 1.1, 'bidclose': 3.3, 'bidhigh': 5...   

                                                      1  
data  {'bidopen': 2.2, 'bidclose': 4.4, 'bidhigh': 6...  

​

The output in this case would always be a single row table with N columns and each cell would be a struct.

Upvotes: 2

Related Questions