Reputation: 43
I have a CSV file with one of the fields with a map as mentioned below "Map(12345 -> 45678, 23465 -> 9876)"
When I am trying to load the csv into dataframe, it is considering it as string. So, I have written a UDF to convert the string to map as below
val convertToMap = udf((pMap: String) => {
val mpp = pMap
// "Map(12345 -> 45678, 23465 -> 9876)"
val stg = mpp.substr(4, mpp.length() -1) val stg1=stg.split(regex=",").toList
val mp=stg1.map(_.split(regex=" ").toList)
val mp1 = mp.map(mp =>
(mp(0), mp(2))).toMap
} )
Now I need help in applying the UDF to the column where it is being taken as string and return the DF with the converted column.
Upvotes: 0
Views: 1585
Reputation: 2495
You are pretty close, but it looks like your UDF has some mix of scala and python, and the parsing logic needs a little work. There may be a better way to parse a map literal string, but this works with the provided example:
val convertToMap = udf { (pMap: String) =>
val stg = pMap.substring(4, pMap.length() - 1)
val stg1 = stg.split(",").toList.map(_.trim)
val mp = stg1.map(_.split(" ").toList)
mp.map(mp =>(mp(0), mp(2))).toMap
}
val df = spark.createDataset(Seq("Map(12345 -> 45678, 23465 -> 9876)")).toDF("strMap")
With the corrected UDF, you simply invoke it with a .select()
or a .withColumn()
:
df.select(convertToMap($"strMap").as("map")).show(false)
Which gives:
+----------------------------------+
|map |
+----------------------------------+
|Map(12345 -> 45678, 23465 -> 9876)|
+----------------------------------+
With the schema:
root
|-- map: map (nullable = true)
| |-- key: string
| |-- value: string (valueContainsNull = true)
Upvotes: 1