Split Complex String in PySpark Dataframe Column

Question

I have a PySpark dataframe column comprised of multiple addresses. The format is as below:

id       addresses
1       [{"city":null,"state":null,"street":"123, ABC St, ABC  Square","postalCode":"11111","country":"USA"},{"city":"Dallas","state":"TX","street":"456, DEF Plaza, Test St","postalCode":"99999","country":"USA"}]

I want to transform it as below:

id	city	state	street	postalCode	country
1	null	null	123, ABC St, ABC Square	11111	USA
1	Dallas	TX	456, DEF Plaza, Test St	99999	USA

Any inputs on how to achieve this using PySpark? The dataset is huge (several TBs) so want to do this in an efficient way.

I tried splitting the address string on comma however since there are commas within the addresses as well, the output is not as expected. I guess I need to use a regular expression pattern with the braces but not sure how. Moreover, how do I go about denormalizing the data?

Split Complex String in PySpark Dataframe Column

Answers (1)

Related Questions