Reputation: 65
I'm trying to convert a number of XML files into a corpus of documents for text analysis using the Quanteda package in R. How do I load the files in R in a way that allows for them to be made into a corpus?
Each XML file contains metadata and text from 16 articles (title, link). I have 119 of these XML files.
Here's an example of an xml file I'm working with:
<item><title>Aseguran que este sábado se verá en las góndolas la rebaja del IVA en alimentos</title>
<link>https://www.losandes.com.ar/article/view?slug=aseguran-que-este-sabado-se-vera-en-las-gondolas-la-rebaja-del-iva-en-alimentos</link><guid>https://www.losandes.com.ar/article/view?slug=aseguran-que-este-sabado-se-vera-en-las-gondolas-la-rebaja-del-iva-en-alimentos</guid><description>Según el presidente de la Asociación de Supermercados Unidos, algunas cadenas podrÃan aplicar los cambios dispuestos por el Gobierno.</description></item>
<item><title>Vélez: Fernando Gago concentra para el duelo contra Lanús</title><link>https://www.losandes.com.ar/article/view?slug=velez-fernando-gago-concentra-para-el-duelo-contra-lanus</link><guid>https://www.losandes.com.ar/article/view?slug=velez-fernando-gago-concentra-para-el-duelo-contra-lanus</guid><description>El ex mediocampista de Boca se entrenó en Liniers y podrÃa ser parte del equipo que el domingo visitará a Lanús, a las 17.45.</description></item>
<item><title>Scaloni prepara la lista para la gira del seleccionado argentino por los Estados Unidos</title><link>https://www.losandes.com.ar/article/view?slug=scaloni-prepara-la-lista-para-la-gira-del-seleccionado-argentino-por-los-estados-unidos</link><guid>https://www.losandes.com.ar/article/view?slug=scaloni-prepara-la-lista-para-la-gira-del-seleccionado-argentino-por-los-estados-unidos</guid><description>Argentina tiene programado dos amistosos en EEUU frente a Chile y México a principio de septiembre y el domingo confirmarÃa la nómina. </description></item>
I cannot figure out how to read these texts into R in a way where Quanteda can recognize that each file contains 16 separate documents.
I have managed to create a list that has sublists for each xml file. That doesn't really seem to work or get me where I want.
I've also tried to simply use the readtext
function that accompanies Quanteda, because it is supposed to be able to read xml but I get an error (shown below).
rm(list=ls())
setwd("~/research projects/data expansion")
library(rvest)
library(stringr)
library(dplyr)
# Here is what I tried to do first:
setwd("data expansion/")
filename <- list.files()
# the first three functions on this page save the title, description, and link for each article
# as character vectors
get.desc <- function(x){
page <- read_html(x)
desc <- html_nodes(page, "item")
desc <- html_nodes(desc, "description")
text <- html_text(desc)
}
get.title <- function(x){
page <- read_html(x)
title <- html_nodes(page, "item")
title <- html_nodes(title, "title")
text <- html_text(title)
}
get.link <- function(x){
page <- read_html(x)
link <- html_nodes(page, "item")
link <- html_nodes(link, "guid")
text <- html_text(link)
}
# to.collect is a function that iterates that last three "get" functions
# and then stores that information into a list
to.collect<-function(file=file){
N <- length(file)
my.store <- vector("list", N)
for(i in 1:N){
my.store[[i]][[1]] <- get.title(file[i])
my.store[[i]][[2]] <- get.desc(file[i])
my.store[[i]][[3]] <- get.link(file[i])
}
my.store
}
# This loop iterates the to.collect function over every file in the folder
# and then stores each file's information into a larger list called "files_all"
N <- length(all_files)
files_all <- list()
for (i in 1:N) {
test <- to.collect(file = all_files[i])
title <- test[[1]][1]
desc <- test[[1]][2]
link <- test[[1]][3]
name <- paste(filename[i],sep = "")
tmp <- list(title=title,description=desc,link=link)
files_all[[name]] <- tmp
}
I don't know what to do from here... so I gave up on it for now
Here is my attempt at simply using readtext()
#install.packages("quanteda")
#install.packages("readtext")
library(XML)
library(xml2)
library(readtext)
library(quanteda)
texts <- readtext("*.xml")
I expect that, when using readtext the result should be a temporary file containing the, now parsed xml files that I can convert into a corpus. Instead I get this error:
> texts <- readtext("*.xml")
Error in xml2_to_dataframe(xml) :
The xml format does not fit for the extraction without xPath
Use xPath method instead
Upvotes: 1
Views: 608
Reputation: 2448
What you want to accomplish can be done like this. I prepare two functions
library(tidyverse)
library(rvest)
## function to parse each item_node
parse_item <- function(nod){
return(data.frame(title = nod %>% html_nodes('title') %>% html_text,
description = nod %>% html_nodes('description') %>% html_text,
link = nod %>% html_nodes('guid') %>% html_text,
stringsAsFactors = F))
}
## function to process entire file
parse_file <- function(filename){
read_html(filename) %>% # read file
html_nodes("item") %>% # extract item nodes
lapply(parse_item) %>% # apply parse_item() function to each nodes
bind_rows() # stach up the each item into one data.frame
}
The following two lines will create a new data.frame.
files <- list.files(pattern = ".xml")
entire_dataset <- lapply(files, parse_file) %>% bind_rows()
Upvotes: 1