Brahmaiah Takkellapati
Brahmaiah Takkellapati

Reputation: 141

Convert String to Map in Spark

Below data in csv file with delimeter |, I want to convert string to Map for PersonalInfo column data so that I can extract required information.

I try to convert below csv to parquet format with String to Map using Cast I got datatype mismatch error.

Below is data for your ref. Your help is much appreciated.

Empcode EmpName PersonalInfo
1       abc     """email"":""[email protected]"",""Location"":""India"",""Gender"":""Male"""
2       xyz     """email"":""[email protected]"",""Location"":""US"""
3       pqr     """email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234"""

Thanks

Upvotes: 1

Views: 6662

Answers (2)

blackbishop
blackbishop

Reputation: 32640

One simple way is to use str_to_map function after you get rid of the double quotes from PersonalInfo column:

val df1 = df.withColumn(
  "PersonalInfo",
  expr("str_to_map(regexp_replace(PersonalInfo, '\"', ''))")
)

df1.show(false)

//+-------+-------+------------------------------------------------------------------------------+
//|Empcode|EmpName|PersonalInfo                                                                  |
//+-------+-------+------------------------------------------------------------------------------+
//|1      |abc    |{email -> [email protected], Location -> India, Gender -> Male}                   |
//|2      |xyz    |{email -> [email protected], Location -> US}                                      |
//|3      |pqr    |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
//+-------+-------+------------------------------------------------------------------------------+

Upvotes: 2

Vincent Doba
Vincent Doba

Reputation: 5068

If you want to create a map from PersonalInfo column, from Spark 3.0 you can proceed as follows:

  • Split your string according to "","" using split function
  • For each element of your obtained string array, create sub-arrays according to "":"" using split function
  • Remove all "" from elements of sub-arrays using regexp_replace function
  • Build map entries using struct function
  • Use map_from_entries to build map from your array of entries

Complete code is as follows:

import org.apache.spark.sql.functions.{col, map_from_entries, regexp_replace, split, struct, transform}

val result = data.withColumn("PersonalInfo",
  map_from_entries(
    transform(
      split(col("PersonalInfo"), "\"\",\"\""),
      item => struct(
        regexp_replace(split(item, "\"\":\"\"")(0), "\"\"", ""),
        regexp_replace(split(item, "\"\":\"\"")(1), "\"\"", "")
      )
    )
  )
)

With the following input_dataframe:

+-------+-------+---------------------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo                                                                                 |
+-------+-------+---------------------------------------------------------------------------------------------+
|1      |abc    |""email"":""[email protected]"",""Location"":""India"",""Gender"":""Male""                       |
|2      |xyz    |""email"":""[email protected]"",""Location"":""US""                                              |
|3      |pqr    |""email"":""[email protected]"",""Gender"":""Female"",""Location"":""Europe"",""Mobile"":""1234""|
+-------+-------+---------------------------------------------------------------------------------------------+

You get the following result dataframe:

+-------+-------+------------------------------------------------------------------------------+
|Empcode|EmpName|PersonalInfo                                                                  |
+-------+-------+------------------------------------------------------------------------------+
|1      |abc    |{email -> [email protected], Location -> India, Gender -> Male}                   |
|2      |xyz    |{email -> [email protected], Location -> US}                                      |
|3      |pqr    |{email -> [email protected], Gender -> Female, Location -> Europe, Mobile -> 1234}|
+-------+-------+------------------------------------------------------------------------------+

Upvotes: 1

Related Questions