Reputation: 43
I m struggling to decode a parsing logic into a dataframe, where there is a XML data within a JSON object. I have read the JSON object successfully and stored in a dataframe like shown below, it contains a col Guest_data which is XML:
Country | Guest_data |
---|---|
Romania | xml 1 |
Hungary | xml 2 |
Ukraine | xml 3 |
I was also able to separately read the XML file with xpath
and explode
functions and store the result in a separate dataframe
XML FORMAT 1
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="AA11" age="68" sex="F" /> <visitor id="BB22" age="34" sex="M" /> <visitor id="CC33" age="23" sex="M" /> </visitors>
XML FORMAT 2
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="FF77" age="27" sex="F" /> <visitor id="YY99" age="32" sex="M" /> </visitors>
XML format 3
<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="DD55" age="68" sex="F" /> <visitor id="LL99" age="34" sex="M" /> <visitor id="SS77" age="47" sex="M" /> <visitor id="TT00" age="30" sex="M" /> </visitors>
What I desire to achieve is below
Country | id | age | sex |
---|---|---|---|
Romania | AA11 | 68 | F |
Romania | BB22 | 34 | M |
Romania | CC33 | 23 | M |
Hungary | FF77 | 27 | F |
Hungary | YY99 | 32 | M |
Ukraine | DD55 | 68 | F |
Ukraine | LL99 | 34 | M |
Ukraine | SS77 | 47 | M |
Ukraine | TT00 | 30 | M |
I wish to prepare a dataframe with the data above, so I can do an average age of the country person and run some more SQL queries.
Upvotes: 1
Views: 1477
Reputation: 24406
Assuming that all the visitors always have XML attributes (id=
, age=
and sex=
), you can make use of Spark SQL's xpath
. It can be done using this one dataframe, there's no need to create separate ones.
Full example:
from pyspark.sql import functions as F
df = spark.createDataFrame(
[('Romania', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="AA11" age="68" sex="F" /> <visitor id="BB22" age="34" sex="M" /> <visitor id="CC33" age="23" sex="M" /> </visitors>'),
('Hungary', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="FF77" age="27" sex="F" /> <visitor id="YY99" age="32" sex="M" /> </visitors>'),
('Ukraine', '<?xml version="1.0" encoding="utf-8"?> <visitors> <visitor id="DD55" age="68" sex="F" /> <visitor id="LL99" age="34" sex="M" /> <visitor id="SS77" age="47" sex="M" /> <visitor id="TT00" age="30" sex="M" /> </visitors>')],
['Country', 'Guest_data']
)
id = F.expr('xpath(Guest_data, "//@id")')
sex = F.expr('xpath(Guest_data, "//@sex")')
age = F.expr('xpath(Guest_data, "//@age")')
df = df.withColumn('exploded', F.explode(F.arrays_zip(id, sex, age)))
df = df.select(
'Country',
F.col('exploded.0').alias('id'),
F.col('exploded.1').alias('sex'),
F.col('exploded.2').alias('age'),
)
df.show()
# +-------+----+---+---+
# |Country| id|sex|age|
# +-------+----+---+---+
# |Romania|AA11| F| 68|
# |Romania|BB22| M| 34|
# |Romania|CC33| M| 23|
# |Hungary|FF77| F| 27|
# |Hungary|YY99| M| 32|
# |Ukraine|DD55| F| 68|
# |Ukraine|LL99| M| 34|
# |Ukraine|SS77| M| 47|
# |Ukraine|TT00| M| 30|
# +-------+----+---+---+
If any of your visitors don't have any attribute (e.g. some visitor did not tell his age), you would not get any error and the results would be incorrect (shifted), so you should take additional measures to make sure that the attributes will always be present.
Upvotes: 2