Parse XML column and store in dataframe

I m struggling to decode a parsing logic into a dataframe, where there is a XML data within a JSON object. I have read the JSON object successfully and stored in a dataframe like shown below, it contains a col Guest_data which is XML:

Country	Guest_data
Romania	xml 1
Hungary	xml 2
Ukraine	xml 3

I was also able to separately read the XML file with xpath and explode functions and store the result in a separate dataframe

XML FORMAT 1

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="AA11" age="68" sex="F" /> <visitor id="BB22" age="34" sex="M" /> <visitor id="CC33" age="23" sex="M" /> </visitors>

XML FORMAT 2

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="FF77" age="27" sex="F" /> <visitor id="YY99" age="32" sex="M" /> </visitors>

XML format 3

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="DD55" age="68" sex="F" /> <visitor id="LL99" age="34" sex="M" /> <visitor id="SS77" age="47" sex="M" /> <visitor id="TT00" age="30" sex="M" /> </visitors>

What I desire to achieve is below

Country	id	age	sex
Romania	AA11	68	F
Romania	BB22	34	M
Romania	CC33	23	M
Hungary	FF77	27	F
Hungary	YY99	32	M
Ukraine	DD55	68	F
Ukraine	LL99	34	M
Ukraine	SS77	47	M
Ukraine	TT00	30	M

I wish to prepare a dataframe with the data above, so I can do an average age of the country person and run some more SQL queries.

Upvotes: 1

Answers (1)

ZygD

Reputation: 24406

Assuming that all the visitors always have XML attributes (id=, age= and sex=), you can make use of Spark SQL's xpath. It can be done using this one dataframe, there's no need to create separate ones.

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('Romania', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="AA11" age="68" sex="F" /> <visitor id="BB22" age="34" sex="M" /> <visitor id="CC33" age="23" sex="M" /> </visitors>'),
     ('Hungary', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="FF77" age="27" sex="F" /> <visitor id="YY99" age="32" sex="M" /> </visitors>'),
     ('Ukraine', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="DD55" age="68" sex="F" /> <visitor id="LL99" age="34" sex="M" /> <visitor id="SS77" age="47" sex="M" /> <visitor id="TT00" age="30" sex="M" /> </visitors>')],
    ['Country', 'Guest_data']
)
id = F.expr('xpath(Guest_data, "//@id")')
sex = F.expr('xpath(Guest_data, "//@sex")')
age = F.expr('xpath(Guest_data, "//@age")')
df = df.withColumn('exploded', F.explode(F.arrays_zip(id, sex, age)))
df = df.select(
    'Country',
    F.col('exploded.0').alias('id'),
    F.col('exploded.1').alias('sex'),
    F.col('exploded.2').alias('age'),
)
df.show()
# +-------+----+---+---+
# |Country|  id|sex|age|
# +-------+----+---+---+
# |Romania|AA11|  F| 68|
# |Romania|BB22|  M| 34|
# |Romania|CC33|  M| 23|
# |Hungary|FF77|  F| 27|
# |Hungary|YY99|  M| 32|
# |Ukraine|DD55|  F| 68|
# |Ukraine|LL99|  M| 34|
# |Ukraine|SS77|  M| 47|
# |Ukraine|TT00|  M| 30|
# +-------+----+---+---+

If any of your visitors don't have any attribute (e.g. some visitor did not tell his age), you would not get any error and the results would be incorrect (shifted), so you should take additional measures to make sure that the attributes will always be present.

Upvotes: 2

Parse XML column and store in dataframe

Answers (1)

Related Questions