Reputation: 445
we have a df
like so:
df <- data.frame(id= c(1,2,3,4,5),
urls= c(NA,NA,"https://www.bing.com",
"https://www.bing.com https://www.google.com",
"https://github.com/"),
stringsAsFactors = FALSE)
Then we have a function that read in real urls, and get the title
tag of each page. Like so-
get_title_tag <- function(url) {
if (is.na(ifelse(url == "", NA, url))) {
return(NA)
}
else if(identical(xml2::read_html(url), character(0))){
return(NA)
}
else{
page <- xml2::read_html(url)
path_to_title <- "/html/head/title"
conf_nodes <- rvest::html_nodes(page, xpath = path_to_title)
title <- rvest::html_text(conf_nodes)
#return(title)
return ("PAGE_TITLE")
}
}
The problem is that the element at 4th position at urls
column contains two consecutive urls, so we get errors. We have looked at several posts here in the forums however none have problems like what We are facing.
Our goal is to get this output:
> df
id urls
1 1 <NA>
2 2 <NA>
3 3 PAGE_TITLE
4 4 PAGE_TITLE PAGE_TITLE
5 5 PAGE_TITLE
I have tried this method that separates the urls, but expands the df which is not what I want:
urls_only_vector <- df %>%
mutate(urls= strsplit(as.character(urls), " ")) %>%
unnest(urls) #%>% select("urls")
Using this method I can read urls one at a time, but again, since it expands my dataframe, I was wondering if there is something else I can do? Can I get an hint please? I will cherish any help.
Upvotes: 0
Views: 65
Reputation: 388982
It is better to get url
's in different rows, apply get_title_tag
function get the title and combine the data again grouping by id so that size of data remains the same.
library(dplyr)
df %>%
tidyr::separate_rows(urls, sep = '\\s+') %>%
mutate(title = purrr::map_chr(urls, get_title_tag)) %>%
group_by(id) %>%
summarise(title = toString(title))
Upvotes: 1