Reputation: 43
I have a dataframe as shown below. I want to convert the last column Trandata from String Type to MapType. The output should look something similar I have shown in 2nd table.
I have written udf but it takes string and converts to Maptype, I'm struggle to get similar output with sql.row as input. :(
def stringToMap(value: String): Map[String, String] = {
var valMap = collection.mutable.Map[String, String]()
val values = value.split(",")
for (i <- values) {
valMap = valMap + (i.split("=")(0) -> i.split("=")(1))
}
return valMap
}
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID |CATEGORY|TRANDATA |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010 |A |threadID=123sada,ProcType=InfraLogging,TxnID=4mjx8wfogf
|03011 |A |threadID=xmjxe2j0jz,ProcType=InfraLogging,TxnID=4mjxe2j0tf
|09941 |D |compTxnID=xmawdew0tf,to=ABCD,threadID=4mjxe2j0jz,ProcType=InfraLogging
|00994 |D |compTxnID=xmjxe2j0tf,to=XYZA,threadID=34jxasde0jz,ProcType=InfraLogging
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
Table2: Expceted output - 3rd col to be MapType
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|MESSAGEID |CATEGORY|TRANDATA |
+--------------+--------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|03010 |A |Map(threadID -> 123sada,ProcType -> InfraLogging,TxnID -> 4mjx8wfogf)
Upvotes: 1
Views: 2214
Reputation: 14905
For Spark 2.4+ you can split the string into key value pairs, then use transform to separate keys and values into two array columns and then use map_from_arrays to create the final map.
df.withColumn("entry", split('TRANDATA, ","))
.withColumn("key", expr("transform(entry, x -> split(x, '=')[0])"))
.withColumn("value", expr("transform(entry, x -> split(x, '=')[1])"))
.withColumn("map", map_from_arrays('key, 'value))
.drop("entry", "key", "value", "TRANDATA")
.show(false)
Output:
+---------+--------+----------------------------------------------------------------------------------------+
|MESSAGEID|CATEGORY|map |
+---------+--------+----------------------------------------------------------------------------------------+
|03010 |A |[threadID -> 123sada, ProcType -> InfraLogging, TxnID -> 4mjx8wfogf] |
|03011 |A |[threadID -> xmjxe2j0jz, ProcType -> InfraLogging, TxnID -> 4mjxe2j0tf] |
|09941 |D |[compTxnID -> xmawdew0tf, to -> ABCD, threadID -> 4mjxe2j0jz, ProcType -> InfraLogging] |
|00994 |D |[compTxnID -> xmjxe2j0tf, to -> XYZA, threadID -> 34jxasde0jz, ProcType -> InfraLogging]|
+---------+--------+----------------------------------------------------------------------------------------+
Upvotes: 3