klib
klib

Reputation: 697

Parsing XML in R data stored in attributes with blanks

I am trying to parse the data contained in an XML file using R. I would like the data to be read into a dataframe. The data is located at the following URL:

http://www.usda.gov/oce/commodity/wasde/report_format/latest-July-2015-New-Format.xml

I have reviewed some similar posts on stackoverflow such as How to convert xml data to data frame in R but I still cannot seem to get them to work. One issue I ran into is that some of the values are empty and then appear like this:

<Cell Textbox73="filler" /> 

The general key value pair is structured as follows:

<m1_attribute_group>
   <s3 attribute1="Trade 2/">
       <s4>
           <Cell cell_value1="372.02"/>
       </s4>
   </s3>
</m1_attribute_group>

Is there any way to parse this file quickly with xmlToDataFrame() or xmlAttrsToDataFrame() or do I have to build custom functions to apply over the list of nodes?

Upvotes: 1

Views: 428

Answers (1)

Parfait
Parfait

Reputation: 107587

First, you need to know exactly what you want extracted. So, dissection is required. This USDA XML output is quite involved across commodities, year/months, and regions with changing nodes for attributes and text/cell values.

Consider an XSLT transformation. As information, XSLT is a special-purpose declarative language that transforms XML files into any nuances for diverse purposes. While other programming languages like R can parse various elements, values, and attributes from XML files, XSLT is specially designed to re-format whole documents to specific needs. Often considered, the forgotten language in the web data age, it can be very useful to transform complex xml data.

For your purposes here is a simple XSL stylesheet to extract Output, Total Supply, Trade, Total Use, and Ending Stocks from various commodities (two value-of declarations are made due to the changing xpaths of the USDA data):

XSLT STYLESHEET

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:strip-space elements="*"/>

<xsl:template match="/">
  <xsl:element name="data">
     <xsl:for-each select="//m1_commodity_group/m1_year_group_Collection/m1_year_group
                            /m1_month_group_Collection/m1_month_group/m1_attribute_group_Collection">
      <xsl:element name="m1_attribute_group">

      <xsl:element name="Output">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Output']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group_Collection/m1_attribute_group[@attribute1='Output']/Textbox12/Cell/@cell_value1"/>  
      </xsl:element>      

      <xsl:element name="TotalSupply">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Total Supply']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Total&#13;&#10;Supply']/Textbox12/Cell/@cell_value1"/>
      </xsl:element>      

      <xsl:element name="Trade2">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Trade 2/']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Trade 2/']/Textbox12/Cell/@cell_value1"/>      
      </xsl:element>      

      <xsl:element name="TotalUse3">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Total&#13;&#10;Use 3/']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Total&#13;&#10;Use 3/']/Textbox12/Cell/@cell_value1"/>
      </xsl:element>

      <xsl:element name="EndingStocks">          
        <xsl:value-of select="m1_attribute_group/s3[@attribute1='Ending&#13;&#10;Stocks']/s4/Cell/@cell_value1"/>
        <xsl:value-of select="m1_attribute_group[@attribute1='Ending&#13;&#10;Stocks']/Textbox12/Cell/@cell_value1"/> 
      </xsl:element>

      </xsl:element>
    </xsl:for-each>
  </xsl:element>
</xsl:template>

</xsl:stylesheet>

OUTPUT

<data>
  <m1_attribute_group>
    <Commodity>Total Grains 4/</Commodity>
    <MarketYear>2013/14</MarketYear>
    <Output>2474.27</Output>
    <TotalSupply>2930.14</TotalSupply>
    <Trade2>372.02</Trade2>
    <TotalUse3>2417.87</TotalUse3>
    <EndingStocks>512.27</EndingStocks>
  </m1_attribute_group>
  ...
</data>

Nearly all programming languages (i.e., Java, C#, Python, PHP) have XSLT transformation libraries, including R with the Sxlst package. However, below is an R script interfacing with VBA calling the Windows MSXML COM object. It requires the RDCOMClient library, an easy installation with RStudio. This is only a Windows PC solution.

R SCRIPT

library(XML)
library(RDCOMClient)

# CREATE INSTANCES OF MICROSOFT XML OBJECTS 
xmlfile = COMCreate("MSXML2.DOMDocument")
xslfile = COMCreate("MSXML2.DOMDocument")
newxmlfile = COMCreate("MSXML2.DOMDocument")

xmlstr = "http://www.usda.gov/oce/commodity/wasde/report_format/latest-July-2015-New-Format.xml"
xslstr = "C:/Path/To/XSLFile.xsl"
newxmlstr = "C:/Path/To/TransformedXMLFile.xsl"

# LOAD XML & XSLT FILES
xmlfile.async = FALSE
xmlfile$Load(xmlstr)

xslfile.async = FALSE
xslfile$Load(xslstr)

# TRANSFORM XML FILE USING XLST INTO NEW FILE
xmlfile$transformNodeToObject(xslfile, newxmlfile)
newxmlfile$Save(newxmlstr)

# CONVERT TRANSFORMED XML INTO R DATA FRAME
doc<-xmlParse("C:/Path/To/TransformedXMLFile.xml")
xmldf <- xmlToDataFrame(nodes = getNodeSet(doc, "//m1_attribute_group"))

# UNINITIALIZE MSXML OBJECTS 
xmlfile <- NULL 
xslfile <- NULL
newxmlfile <- NULL

Above stylesheet renders the following data frame:

DATA FRAME

    Commodity   MarketYear  Output  TotalSupply Trade2  TotalUse3   EndingStocks
1   Total Grains 4/ 2013/14 2474.27 2930.14 372.02  2417.87 512.27
2   Total Grains 4/ 2014/15 (Est.)  2492.63 3004.90 375.54  2464.63 540.27
3   Total Grains 4/ 2015/16 (Proj.)     2479.30 3010.96 360.65  2489.78 521.18
4   Total Grains 4/ 2015/16 (Proj.)     2477.26 3017.53 362.61  2485.14 532.39
5   Wheat   2013/14 715.11  892.24  165.92  698.72  193.52
6   Wheat   2014/15 (Est.)  725.92  919.44  163.94  707.38  212.06
7   Wheat   2015/16 (Proj.)     721.55  921.96  158.41  719.56  202.40
8   Wheat   2015/16 (Proj.)     721.96  934.01  158.07  714.20  219.81
9   Coarse Grains 5/    2013/14 1280.98 1449.07 164.39  1237.67 211.40
10  Coarse Grains 5/    2014/15 (Est.)  1290.43 1501.83 168.58  1272.56 229.27
11  Coarse Grains 5/    2015/16 (Proj.)     1276.00 1508.58 159.74  1281.23 227.35
12  Coarse Grains 5/    2015/16 (Proj.)     1274.97 1504.24 162.30  1282.17 222.07
13  Rice, milled    2013/14 478.18  588.83  41.72   481.48  107.36
14  Rice, milled    2014/15 (Est.)  476.28  583.64  43.02   484.69  98.95
15  Rice, milled    2015/16 (Proj.)     481.74  580.43  42.51   488.99  91.44
16  Rice, milled    2015/16 (Proj.)     480.34  579.28  42.24   488.78  90.51

Upvotes: 4

Related Questions