Parsing XML to DATA FRAME

There is a data base stored in an XML file from a Mexican government page that I'm trying to download to use for my analysis.

The page where you can find the data is this.

https://datos.gob.mx/busca/dataset/estaciones-de-servicio-gasolineras-y-precios-comerciales-de-gasolina-y-diesel

The direct download link is this, I think is like an external repository. Sincerely I don't know.

https://publicacionexterna.azurewebsites.net/publicaciones/prices

If you click the link above, the database in XML format is downloaded automatically.

The database is about Mexican gas prices from retail sellers with his location in decimal degrees across the country.

I'm able to download the data base and paste to a windows .xls file and then a .csv archive and then uplod to my R enviroment for the analysis.

The general problem is when I'm trying to download directly from the page to my R enviroment I'm not able to obtain an structured data base format that allows me perform the analysis.

I'm obtaining duplicate rows and, cannot extract all the attributes for each level of the data.

This is the script that I was able to write by my self, and looking for help in the internet.

# CRE FILES

library(easypackages)

my_packages <- c("rlist","readr", "tidyverse", "lubridate", "stringr", 
"rebus", "stringi", "purrr", "geosphere", "XML", "RCurl", "plyr")

libraries(my_packages)

# Link de descarga de documentos 

link1 <-(https://publicacionexterna.azurewebsites.net/publicaciones/prices")

# First we load the xml file to the enviroment

data_prices <- getURL(link1)

xmlfile <- xmlParse(data_prices)

class(xmlfile)

xmltop <- xmlRoot(xmlfile)

base <- ldply(xmlToList(xmltop),data.frame)

The problem is that I would like the date as another column, not as a row.

Upvotes: 0

Views: 2657

Answers (1)

Matt Jewett
Matt Jewett

Reputation: 3369

Something like this should get you a dataframe with all of the data in separate columns.

library(RCurl)
library(XML)

# Set link to website
link1 <-("https://publicacionexterna.azurewebsites.net/publicaciones/prices")

# Get data from webpage
data_prices <- getURL(link1)

# Parse XML data
xmlfile <- xmlParse(data_prices)

# Get place nodes
places <- getNodeSet(xmlfile, "//place")

# Get values for each place
values <- lapply(places, function(x){
                          # Get current place id
                          pid <- xmlAttrs(x)

                          # Get values for each gas type for current place
                          newrows <- lapply(xmlChildren(x), function(y){
                                                              # Get type and update time values
                                                              attrs <- xmlAttrs(y)

                                                              # Get price value
                                                              price <- xmlValue(y)
                                                              names(price) <- "price"

                                                              # Return values
                                                              return(c(pid, attrs, price))
                                                            })
                          # Combine rows to single list
                          newrows <- do.call(rbind, newrows)

                          # Return rows
                          return(newrows)
                       })

# Combine all values into a single dataframe
df <- as.data.frame(do.call(rbind, values), stringsAsFactors = FALSE)

# Reset row names for dataframe
row.names(df) <- c(1:nrow(df))

Upvotes: 2

Related Questions