Reputation: 141
Below data in csv file with delimeter |
, I want to convert string to Map for PersonalInfo
column data so that I can extract required information.
I try to convert below csv to parquet format with String
to Map
using Cast I got datatype mismatch error.
Below is data for your ref. Your help is much appreciated.
Empcode EmpName PersonalInfo
1 abc """email"":""[email protected]"",""Location"":""India"",""Gender"":""Male"""
2 xyz """email"":""[email protected]"",""Location"":""US"""
3 pqr """email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""
Thanks
Upvotes: 1
Views: 6662
Reputation: 32640
One simple way is to use str_to_map
function after you get rid of the double quotes from PersonalInfo
column:
val df1 = df.withColumn(
"PersonalInfo",
expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)
df1.show(false)
//+-------+-------+------------------------------------------------------------------------------+
//|Empcode|EmpName|PersonalInfo |
//+-------+-------+------------------------------------------------------------------------------+
//|1 |abc |{email -> [email protected], Location -> India, Gender -> Male} |
//|2 |xyz |{email -> [email protected], Location -> US} |
//|3 |pqr |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
//+-------+-------+------------------------------------------------------------------------------+
Upvotes: 2
Reputation: 5068
If you want to create a map from PersonalInfo
column, from Spark 3.0 you can proceed as follows:
"",""
using split
function"":""
using split
function""
from elements of sub-arrays using regexp_replace
functionstruct
functionmap_from_entries
to build map from your array of entriesComplete code is as follows:
import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}
val result = data.withColumn("PersonalInfo",
map_from_entries(
transform(
split(col("PersonalInfo"), "\"\",\"\""),
item => struct(
regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
)
)
)
)
With the following input_dataframe
:
+-------+-------+---------------------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+---------------------------------------------------------------------------------------------+
|1 |abc |""email"":""[email protected]"",""Location"":""India"",""Gender"":""Male"" |
|2 |xyz |""email"":""[email protected]"",""Location"":""US"" |
|3 |pqr |""email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
+-------+-------+---------------------------------------------------------------------------------------------+
You get the following result
dataframe:
+-------+-------+------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo |
+-------+-------+------------------------------------------------------------------------------+
|1 |abc |{email -> [email protected], Location -> India, Gender -> Male} |
|2 |xyz |{email -> [email protected], Location -> US} |
|3 |pqr |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
+-------+-------+------------------------------------------------------------------------------+
Upvotes: 1