Sankalp Tomar
Sankalp Tomar

Reputation: 83

How to convert string colon-separated column to MapType?

I am trying to convert a Dataframe to RDD in order to explode the map (with key-value pair) into different row.

Info = sqlContext.read.format("csv"). \
option("delimiter","\t"). \
option("header", "True"). \
option("inferSchema", "True"). \
load("file.tsv")

DataFrame[ID: int, Date: timestamp, Comments: string]

The sample data in the DF is as follows.

ID  Date                    Comments    
1   2015-04-30 22:42:49.0   {44:'xxxxxxxx'}
2   2015-05-06 08:53:18.0   {83:'aaaaaaaaa', 175:'bbbbbbbbb', 86:'cccccccccc'}
3   2015-05-13 19:57:13.0   {487:'yyyyyyyyyyy', 48:'zzzzzzzzzzzzzz'}

Now, the comments are already in key-value pairs but it is read as a string, I want to explode each key-value pair into a different row. For e.g.

Expected OUTPUT
ID  Date                     Comments   
1   2015-04-30 22:42:49.0    {44:'xxxxxxxx'}
2   2015-05-06 08:53:18.0    {83:'aaaaaaaaa'}
2   2015-05-06 08:53:18.0    {175:'bbbbbbbbb'}
2   2015-05-06 08:53:18.0    {86:'cccccccccc'}
3   2015-05-13 19:57:13.0    {487:'yyyyyyyyyyy'}
3   2015-05-13 19:57:13.0    {48:'zzzzzzzzzzzzzz'}

I have tried to convert it to a RDD and apply flatMap but to no success. I want all columns to be returned. I have tried this:

Info.rdd.flatMap(lambda x: (x['SearchParams'].split(':'), x))

Upvotes: 5

Views: 7880

Answers (1)

Shaido
Shaido

Reputation: 28392

Use the provided split and explode functions in the DataFrame API to split the data on ",". To create the map, you want to use create_map. This function expects two separate columns as input. Here below is an example were two temporary columns are created (again using split):

Info.withColumn("Comments", explode(split(col("Comments"), ", ")))
  .withColumn("key", split(col("Comments"), ":").getItem(0))
  .withColumn("value", split(col("Comments"), ":").getItem(1))
  .withColumn("Comments", create_map(col("key"), col("value")))

It should be possible to make this shorter like this (not tested):

Info.withColumn("Comments", split(explode(split(col("Comments), ", ")), ":")
  .withColumn("Comments", create_map(col("Comments".getItem(0)), col("Comments").getItem(1)))

Upvotes: 2

Related Questions