Reputation: 57
There is a data base stored in an XML file from a Mexican government page that I'm trying to download to use for my analysis.
The page where you can find the data is this.
The direct download link is this, I think is like an external repository. Sincerely I don't know.
https://publicacionexterna.azurewebsites.net/publicaciones/prices
If you click the link above, the database in XML format is downloaded automatically.
The database is about Mexican gas prices from retail sellers with his location in decimal degrees across the country.
I'm able to download the data base and paste to a windows .xls file and then a .csv archive and then uplod to my R enviroment for the analysis.
The general problem is when I'm trying to download directly from the page to my R enviroment I'm not able to obtain an structured data base format that allows me perform the analysis.
I'm obtaining duplicate rows and, cannot extract all the attributes for each level of the data.
This is the script that I was able to write by my self, and looking for help in the internet.
# CRE FILES
library(easypackages)
my_packages <- c("rlist","readr", "tidyverse", "lubridate", "stringr",
"rebus", "stringi", "purrr", "geosphere", "XML", "RCurl", "plyr")
libraries(my_packages)
# Link de descarga de documentos
link1 <-(https://publicacionexterna.azurewebsites.net/publicaciones/prices")
# First we load the xml file to the enviroment
data_prices <- getURL(link1)
xmlfile <- xmlParse(data_prices)
class(xmlfile)
xmltop <- xmlRoot(xmlfile)
base <- ldply(xmlToList(xmltop),data.frame)
The problem is that I would like the date as another column, not as a row.
Upvotes: 0
Views: 2657
Reputation: 3369
Something like this should get you a dataframe with all of the data in separate columns.
library(RCurl)
library(XML)
# Set link to website
link1 <-("https://publicacionexterna.azurewebsites.net/publicaciones/prices")
# Get data from webpage
data_prices <- getURL(link1)
# Parse XML data
xmlfile <- xmlParse(data_prices)
# Get place nodes
places <- getNodeSet(xmlfile, "//place")
# Get values for each place
values <- lapply(places, function(x){
# Get current place id
pid <- xmlAttrs(x)
# Get values for each gas type for current place
newrows <- lapply(xmlChildren(x), function(y){
# Get type and update time values
attrs <- xmlAttrs(y)
# Get price value
price <- xmlValue(y)
names(price) <- "price"
# Return values
return(c(pid, attrs, price))
})
# Combine rows to single list
newrows <- do.call(rbind, newrows)
# Return rows
return(newrows)
})
# Combine all values into a single dataframe
df <- as.data.frame(do.call(rbind, values), stringsAsFactors = FALSE)
# Reset row names for dataframe
row.names(df) <- c(1:nrow(df))
Upvotes: 2