Jatin
Jatin

Reputation: 113

Split Complex String in PySpark Dataframe Column

I have a PySpark dataframe column comprised of multiple addresses. The format is as below:

id       addresses
1       [{"city":null,"state":null,"street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]

I want to transform it as below:

id city state street postalCode country
1 null null 123, ABC St, ABC Square 11111 USA
1 Dallas TX 456, DEF Plaza, Test St 99999 USA

Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.

I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?

Upvotes: 1

Views: 130

Answers (1)

wwnde
wwnde

Reputation: 26676

#Data

from pyspark.sql.functions import *
df =spark.createDataFrame([(1,'{"city":"New York","state":"NY","street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}')],
                         ('id','addresses'))
df.show(truncate=False)

#pass the string column to rdd to extracr schema
rdd=df.select(col("addresses").alias("jsoncol")).rdd.map(lambda x: x.jsoncol)
newschema =spark.read.json(rdd).schema

##Apply schema to string column reading using from_schema
df3=df.select("*",from_json("addresses", newschema).alias("test_col"))#Assign schema to column using select

df3.select('id','test_col.*').show()

+---+--------+-------+----------+-----+------------------------+
|id |city    |country|postalCode|state|street                  |
+---+--------+-------+----------+-----+------------------------+
|1  |New York|USA    |11111     |NY   |123, ABC St, ABC  Square|
+---+--------+-------+----------+-----+------------------------+

Upvotes: 1

Related Questions