Reputation: 2923
I have a situation whereby my rdd keys differs within each dictionary, some having more and different keys than others.
Because of this I am unable to use a toDF()
to covert them directly. Does anyone have a better idea?
list1 = [{'this':'bah', 'is': 'bah'},
{'this': 'true', 'is': 'false'},
{'this': 'true', 'is': 'false', 'testing':'bah'}]
rdd = sc.parallelize(list1)
rdd.map(lambda x: Row(**x)).toDF().show()
Upvotes: 0
Views: 235
Reputation: 4452
I guess there is not an out-of-the-box solution for that.
At first glance, what I would do is create a set()
list with all the columns in my collection, then iterate over each row to create all non-existant columns and initialize them to None
:
list1 = [{'this':'bah', 'is': 'bah'},
{'this': 'true', 'is': 'false'},
{'this': 'true', 'is': 'false', 'testing':'bah'}]
# create a list of unique available keys
keys = set().union(*(item.keys() for item in list1))
for key,item in enumerate(list1):
# find which ones are not in the current row
difference = [i for i in keys if i not in item]
if len(difference) > 0:
# create them
for i in range(0,len(difference)):
item[difference[i]] = None
And then your collection has the same number of columns:
[{'this': 'bah', 'is': 'bah', 'testing': None}, {'this': 'true', 'is': 'false', 'testing': None}, {'this': 'true', 'is': 'false', 'testing': 'bah'}]
Upvotes: 1