Chirag
Chirag

Reputation: 31

Convert PySpark data frame to JSON with each column as a key

I'm working on PySpark. I have a data frame which I need to dump as a JSON file but the the JSON file should have the following format for example -

{"Column 1": [9202, 9202, 9202, ....], "Column 2": ["FEMALE", "No matching concept", "MALE", ....]}

So there should be 1 key for each column and the corresponding value should have a list of all the values in that column

I tried to convert this to a Pandas data frame and then convert to a dict before dumping it as a JSON and was successful in doing that but as the data volume is very I want to do it directly on the PySpark data frame

Upvotes: 1

Views: 1050

Answers (2)

Ghost
Ghost

Reputation: 520

L = []
for j in range(0, len(df.columns)):
        arr = []
        for i in range(0, df.count()):
                arr.append(df.collect()[i][j])
        L.append(arr)
columns = df.columns

data_dict = dict(zip(columns, L))
print(data_dict)

Upvotes: 0

blackbishop
blackbishop

Reputation: 32640

One way is to collect each column values as array before you write to JSON. Try this:

column_arrays = [collect_list(c).alias(c) for c in df.columns]
df2 = df.groupBy().agg(*column_arrays)

df2.coalesce(1).write.mode("overwrite").json("/path")

Upvotes: 3

Related Questions