Reputation: 697
I am trying to parse the data contained in an XML file using R. I would like the data to be read into a dataframe. The data is located at the following URL:
http://www.usda.gov/oce/commodity/wasde/report_format/latest-July-2015-New-Format.xml
I have reviewed some similar posts on stackoverflow such as How to convert xml data to data frame in R but I still cannot seem to get them to work. One issue I ran into is that some of the values are empty and then appear like this:
<Cell Textbox73="filler" />
The general key value pair is structured as follows:
<m1_attribute_group>
<s3 attribute1="Trade 2/">
<s4>
<Cell cell_value1="372.02"/>
</s4>
</s3>
</m1_attribute_group>
Is there any way to parse this file quickly with xmlToDataFrame() or xmlAttrsToDataFrame() or do I have to build custom functions to apply over the list of nodes?
Upvotes: 1
Views: 428
Reputation: 107587
First, you need to know exactly what you want extracted. So, dissection is required. This USDA XML output is quite involved across commodities, year/months, and regions with changing nodes for attributes and text/cell values.
Consider an XSLT transformation. As information, XSLT is a special-purpose declarative language that transforms XML files into any nuances for diverse purposes. While other programming languages like R can parse various elements, values, and attributes from XML files, XSLT is specially designed to re-format whole documents to specific needs. Often considered, the forgotten language in the web data age, it can be very useful to transform complex xml data.
For your purposes here is a simple XSL stylesheet to extract Output, Total Supply, Trade, Total Use, and Ending Stocks from various commodities (two value-of
declarations are made due to the changing xpaths of the USDA data):
XSLT STYLESHEET
<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:strip-space elements="*"/>
<xsl:template match="/">
<xsl:element name="data">
<xsl:for-each select="//m1_commodity_group/m1_year_group_Collection/m1_year_group
/m1_month_group_Collection/m1_month_group/m1_attribute_group_Collection">
<xsl:element name="m1_attribute_group">
<xsl:element name="Output">
<xsl:value-of select="m1_attribute_group/s3[@attribute1='Output']/s4/Cell/@cell_value1"/>
<xsl:value-of select="m1_attribute_group_Collection/m1_attribute_group[@attribute1='Output']/Textbox12/Cell/@cell_value1"/>
</xsl:element>
<xsl:element name="TotalSupply">
<xsl:value-of select="m1_attribute_group/s3[@attribute1='Total Supply']/s4/Cell/@cell_value1"/>
<xsl:value-of select="m1_attribute_group[@attribute1='Total Supply']/Textbox12/Cell/@cell_value1"/>
</xsl:element>
<xsl:element name="Trade2">
<xsl:value-of select="m1_attribute_group/s3[@attribute1='Trade 2/']/s4/Cell/@cell_value1"/>
<xsl:value-of select="m1_attribute_group[@attribute1='Trade 2/']/Textbox12/Cell/@cell_value1"/>
</xsl:element>
<xsl:element name="TotalUse3">
<xsl:value-of select="m1_attribute_group/s3[@attribute1='Total Use 3/']/s4/Cell/@cell_value1"/>
<xsl:value-of select="m1_attribute_group[@attribute1='Total Use 3/']/Textbox12/Cell/@cell_value1"/>
</xsl:element>
<xsl:element name="EndingStocks">
<xsl:value-of select="m1_attribute_group/s3[@attribute1='Ending Stocks']/s4/Cell/@cell_value1"/>
<xsl:value-of select="m1_attribute_group[@attribute1='Ending Stocks']/Textbox12/Cell/@cell_value1"/>
</xsl:element>
</xsl:element>
</xsl:for-each>
</xsl:element>
</xsl:template>
</xsl:stylesheet>
OUTPUT
<data>
<m1_attribute_group>
<Commodity>Total Grains 4/</Commodity>
<MarketYear>2013/14</MarketYear>
<Output>2474.27</Output>
<TotalSupply>2930.14</TotalSupply>
<Trade2>372.02</Trade2>
<TotalUse3>2417.87</TotalUse3>
<EndingStocks>512.27</EndingStocks>
</m1_attribute_group>
...
</data>
Nearly all programming languages (i.e., Java, C#, Python, PHP) have XSLT transformation libraries, including R with the Sxlst package. However, below is an R script interfacing with VBA calling the Windows MSXML COM object. It requires the RDCOMClient library, an easy installation with RStudio. This is only a Windows PC solution.
R SCRIPT
library(XML)
library(RDCOMClient)
# CREATE INSTANCES OF MICROSOFT XML OBJECTS
xmlfile = COMCreate("MSXML2.DOMDocument")
xslfile = COMCreate("MSXML2.DOMDocument")
newxmlfile = COMCreate("MSXML2.DOMDocument")
xmlstr = "http://www.usda.gov/oce/commodity/wasde/report_format/latest-July-2015-New-Format.xml"
xslstr = "C:/Path/To/XSLFile.xsl"
newxmlstr = "C:/Path/To/TransformedXMLFile.xsl"
# LOAD XML & XSLT FILES
xmlfile.async = FALSE
xmlfile$Load(xmlstr)
xslfile.async = FALSE
xslfile$Load(xslstr)
# TRANSFORM XML FILE USING XLST INTO NEW FILE
xmlfile$transformNodeToObject(xslfile, newxmlfile)
newxmlfile$Save(newxmlstr)
# CONVERT TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("C:/Path/To/TransformedXMLFile.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//m1_attribute_group"))
# UNINITIALIZE MSXML OBJECTS
xmlfile <- NULL
xslfile <- NULL
newxmlfile <- NULL
Above stylesheet renders the following data frame:
DATA FRAME
Commodity MarketYear Output TotalSupply Trade2 TotalUse3 EndingStocks
1 Total Grains 4/ 2013/14 2474.27 2930.14 372.02 2417.87 512.27
2 Total Grains 4/ 2014/15 (Est.) 2492.63 3004.90 375.54 2464.63 540.27
3 Total Grains 4/ 2015/16 (Proj.) 2479.30 3010.96 360.65 2489.78 521.18
4 Total Grains 4/ 2015/16 (Proj.) 2477.26 3017.53 362.61 2485.14 532.39
5 Wheat 2013/14 715.11 892.24 165.92 698.72 193.52
6 Wheat 2014/15 (Est.) 725.92 919.44 163.94 707.38 212.06
7 Wheat 2015/16 (Proj.) 721.55 921.96 158.41 719.56 202.40
8 Wheat 2015/16 (Proj.) 721.96 934.01 158.07 714.20 219.81
9 Coarse Grains 5/ 2013/14 1280.98 1449.07 164.39 1237.67 211.40
10 Coarse Grains 5/ 2014/15 (Est.) 1290.43 1501.83 168.58 1272.56 229.27
11 Coarse Grains 5/ 2015/16 (Proj.) 1276.00 1508.58 159.74 1281.23 227.35
12 Coarse Grains 5/ 2015/16 (Proj.) 1274.97 1504.24 162.30 1282.17 222.07
13 Rice, milled 2013/14 478.18 588.83 41.72 481.48 107.36
14 Rice, milled 2014/15 (Est.) 476.28 583.64 43.02 484.69 98.95
15 Rice, milled 2015/16 (Proj.) 481.74 580.43 42.51 488.99 91.44
16 Rice, milled 2015/16 (Proj.) 480.34 579.28 42.24 488.78 90.51
Upvotes: 4