Pyspark: convert rdd with different keys to spark dataframe

Question

I have a situation whereby my rdd keys differs within each dictionary, some having more and different keys than others.

Because of this I am unable to use a toDF() to covert them directly. Does anyone have a better idea?

list1 = [{'this':'bah', 'is': 'bah'}, 
         {'this': 'true', 'is': 'false'}, 
         {'this': 'true', 'is': 'false', 'testing':'bah'}]

rdd = sc.parallelize(list1)
rdd.map(lambda x: Row(**x)).toDF().show()

TMichel · Accepted Answer

I guess there is not an out-of-the-box solution for that.

At first glance, what I would do is create a set() list with all the columns in my collection, then iterate over each row to create all non-existant columns and initialize them to None:

list1 = [{'this':'bah', 'is': 'bah'}, 
         {'this': 'true', 'is': 'false'}, 
         {'this': 'true', 'is': 'false', 'testing':'bah'}]

# create a list of unique available keys   
keys = set().union(*(item.keys() for item in list1))

for key,item in enumerate(list1):
    # find which ones are not in the current row
    difference = [i for i in keys if i not in item]
    if len(difference) > 0:
        # create them
        for i in range(0,len(difference)):
            item[difference[i]] = None

And then your collection has the same number of columns:

[{'this': 'bah', 'is': 'bah', 'testing': None}, {'this': 'true', 'is': 'false', 'testing': None}, {'this': 'true', 'is': 'false', 'testing': 'bah'}]

Pyspark: convert rdd with different keys to spark dataframe

Answers (1)

Related Questions