Sergey Ivanov
Sergey Ivanov

Reputation: 3929

Write struct columns to parquet with pyarrow

I have the following dataframe and schema:

df = pd.DataFrame([[1,2,3], [4,5,6], [7,8,9]], columns=['a', 'b', 'c'])
SCHEMA = pa.schema([("a_and_b", pa.struct([('a', pa.int64()), ('b', pa.int64())])), ('c', pa.int64())])

Then I want to create a pyarrow table from df and save it to parquet with this schema. However, I could not find a way to create a proper type in pandas that would correspond to a struct type in pyarrow. Is there a way to do this?

Upvotes: 0

Views: 405

Answers (1)

0x26res
0x26res

Reputation: 13952

For pa.struct convertion from pandas you can use a tuples (eg: [(1, 4), (2, 5), (3, 6)]):

df_with_tuples = pd.DataFrame({
    "a_and_b": zip(df["a"], df["b"]),
    "c":  df["c"]
})
pa.Table.from_pandas(df_with_tuples, SCHEMA)

or dict [{'a': 1, 'b': 2}, {'a': 4, 'b': 5}, {'a': 7, 'b': 8}]:

df_with_dict = pd.DataFrame({
    "a_and_b": df.apply(lambda x: {"a": x["a"], "b": x["b"] }, axis=1),
    "c":  df["c"]
})
pa.Table.from_pandas(df_with_dict , SCHEMA)

When converting back from arrow to pandas, struct are represented as dict:

pa.Table.from_pandas(df_with_dict , SCHEMA).to_pandas()['a_and_b']
| a_and_b          |
|:-----------------|
| {'a': 1, 'b': 2} |
| {'a': 4, 'b': 5} |
| {'a': 7, 'b': 8} |

Upvotes: 2

Related Questions