AndyD
AndyD

Reputation: 43

Parse XML column and store in dataframe

I m struggling to decode a parsing logic into a dataframe, where there is a XML data within a JSON object. I have read the JSON object successfully and stored in a dataframe like shown below, it contains a col Guest_data which is XML:

Country Guest_data
Romania xml 1
Hungary xml 2
Ukraine xml 3

I was also able to separately read the XML file with xpath and explode functions and store the result in a separate dataframe

XML FORMAT 1

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="AA11" age="68" sex="F" /> <visitor id="BB22" age="34" sex="M" /> <visitor id="CC33" age="23" sex="M" /> </visitors>

XML FORMAT 2

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="FF77" age="27" sex="F" /> <visitor id="YY99" age="32" sex="M" /> </visitors>

XML format 3

<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="DD55" age="68" sex="F" /> <visitor id="LL99" age="34" sex="M" /> <visitor id="SS77" age="47" sex="M" /> <visitor id="TT00" age="30" sex="M" /> </visitors>

What I desire to achieve is below

Country id age sex
Romania AA11 68 F
Romania BB22 34 M
Romania CC33 23 M
Hungary FF77 27 F
Hungary YY99 32 M
Ukraine DD55 68 F
Ukraine LL99 34 M
Ukraine SS77 47 M
Ukraine TT00 30 M

I wish to prepare a dataframe with the data above, so I can do an average age of the country person and run some more SQL queries.

Upvotes: 1

Views: 1477

Answers (1)

ZygD
ZygD

Reputation: 24406

Assuming that all the visitors always have XML attributes (id=, age= and sex=), you can make use of Spark SQL's xpath. It can be done using this one dataframe, there's no need to create separate ones.

Full example:

from pyspark.sql import functions as F
df = spark.createDataFrame(
    [('Romania', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="AA11" age="68" sex="F" /> <visitor id="BB22" age="34" sex="M" /> <visitor id="CC33" age="23" sex="M" /> </visitors>'),
     ('Hungary', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="FF77" age="27" sex="F" /> <visitor id="YY99" age="32" sex="M" /> </visitors>'),
     ('Ukraine', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="DD55" age="68" sex="F" /> <visitor id="LL99" age="34" sex="M" /> <visitor id="SS77" age="47" sex="M" /> <visitor id="TT00" age="30" sex="M" /> </visitors>')],
    ['Country', 'Guest_data']
)
id = F.expr('xpath(Guest_data, "//@id")')
sex = F.expr('xpath(Guest_data, "//@sex")')
age = F.expr('xpath(Guest_data, "//@age")')
df = df.withColumn('exploded', F.explode(F.arrays_zip(id, sex, age)))
df = df.select(
    'Country',
    F.col('exploded.0').alias('id'),
    F.col('exploded.1').alias('sex'),
    F.col('exploded.2').alias('age'),
)
df.show()
# +-------+----+---+---+
# |Country|  id|sex|age|
# +-------+----+---+---+
# |Romania|AA11|  F| 68|
# |Romania|BB22|  M| 34|
# |Romania|CC33|  M| 23|
# |Hungary|FF77|  F| 27|
# |Hungary|YY99|  M| 32|
# |Ukraine|DD55|  F| 68|
# |Ukraine|LL99|  M| 34|
# |Ukraine|SS77|  M| 47|
# |Ukraine|TT00|  M| 30|
# +-------+----+---+---+

If any of your visitors don't have any attribute (e.g. some visitor did not tell his age), you would not get any error and the results would be incorrect (shifted), so you should take additional measures to make sure that the attributes will always be present.

Upvotes: 2

Related Questions