Wayne Y
Wayne Y

Reputation: 53

Writing PySpark DataFrame with MapType Schema to Parquet Format

I have searched on the web and on here, but unable to find any resolution to an issue I'm facing.

First, I'm using PySpark. I have data as a DataFrame that I would like to write out as parquet. The schema is dictated by something like this:

df_schema = StructType([StructField('p_id', StringType(), True),
                        StructField('c_id_map', MapType(StringType(), StringType(), True), True),
                        StructField('d_id', LongType(), True)])

My data does have these columns and the c_id_map is a Python dictionary that has a key that is either 'e_id' or 'r_id' and a value that is a string (some identifier).

I write the data using something like:

df = sqlContext.createDataFrame(hour_filtered_rdd, df_schema)
dfwriter = df.write
dfwriter.mode('overwrite')
dfwriter.format('parquet')
dfwriter.parquet(output_path)

The parquet file is written out, however when I use parquet-tools to view the contents I see that the c_id_map is always empty (i.e. nothing is printed out from the cat command), like:

c_id_map:

I that data exists in the dictionary prior to writing. All other data types (Strings and Longs) are written out correctly. As a workaround, I'm storing the map data as a JSON string, but I would like to understand what is going wrong.

Any ideas on this? Or, is the issue with the parquet-tools not being able to display map data?

Upvotes: 0

Views: 2810

Answers (1)

Wayne Y
Wayne Y

Reputation: 53

I am dumb and not sure how I missed this. Just ignore the question as there was no problem at all it turns out. The data is present in the parquet file, and it can be seen properly using the parquet-tools utility.

The output from the parquet-tools cat command looks like:

c_id_map:
.key_value:
..key = e_id
..value = 6710c982

Upvotes: 0

Related Questions