Reputation: 83
I am trying to convert a Dataframe to RDD in order to explode the map (with key-value pair) into different row.
Info = sqlContext.read.format("csv"). \
option("delimiter","\t"). \
option("header", "True"). \
option("inferSchema", "True"). \
load("file.tsv")
DataFrame[ID: int, Date: timestamp, Comments: string]
The sample data in the DF is as follows.
ID Date Comments
1 2015-04-30 22:42:49.0 {44:'xxxxxxxx'}
2 2015-05-06 08:53:18.0 {83:'aaaaaaaaa', 175:'bbbbbbbbb', 86:'cccccccccc'}
3 2015-05-13 19:57:13.0 {487:'yyyyyyyyyyy', 48:'zzzzzzzzzzzzzz'}
Now, the comments are already in key-value pairs but it is read as a string, I want to explode each key-value pair into a different row. For e.g.
Expected OUTPUT
ID Date Comments
1 2015-04-30 22:42:49.0 {44:'xxxxxxxx'}
2 2015-05-06 08:53:18.0 {83:'aaaaaaaaa'}
2 2015-05-06 08:53:18.0 {175:'bbbbbbbbb'}
2 2015-05-06 08:53:18.0 {86:'cccccccccc'}
3 2015-05-13 19:57:13.0 {487:'yyyyyyyyyyy'}
3 2015-05-13 19:57:13.0 {48:'zzzzzzzzzzzzzz'}
I have tried to convert it to a RDD and apply flatMap
but to no success. I want all columns to be returned. I have tried this:
Info.rdd.flatMap(lambda x: (x['SearchParams'].split(':'), x))
Upvotes: 5
Views: 7880
Reputation: 28392
Use the provided split
and explode
functions in the DataFrame API to split the data on ",". To create the map, you want to use create_map
. This function expects two separate columns as input. Here below is an example were two temporary columns are created (again using split
):
Info.withColumn("Comments", explode(split(col("Comments"), ", ")))
.withColumn("key", split(col("Comments"), ":").getItem(0))
.withColumn("value", split(col("Comments"), ":").getItem(1))
.withColumn("Comments", create_map(col("key"), col("value")))
It should be possible to make this shorter like this (not tested):
Info.withColumn("Comments", split(explode(split(col("Comments), ", ")), ":")
.withColumn("Comments", create_map(col("Comments".getItem(0)), col("Comments").getItem(1)))
Upvotes: 2