replace a url (or a string containing multiple urls) with a value returned from a function

Question

we have a df like so:

df <- data.frame(id= c(1,2,3,4,5),
                 urls= c(NA,NA,"https://www.bing.com",
                         "https://www.bing.com https://www.google.com",
                         "https://github.com/"),
                 stringsAsFactors = FALSE)

Then we have a function that read in real urls, and get the title tag of each page. Like so-

get_title_tag <- function(url) {

  if (is.na(ifelse(url == "", NA, url))) {
    return(NA)
  }
  else if(identical(xml2::read_html(url), character(0))){
    return(NA)
  }
  else{
    page <- xml2::read_html(url)

    path_to_title <- "/html/head/title"

    conf_nodes <- rvest::html_nodes(page, xpath = path_to_title)

    title <- rvest::html_text(conf_nodes)

    #return(title)
   return ("PAGE_TITLE")
  }
}

The problem is that the element at 4th position at urls column contains two consecutive urls, so we get errors. We have looked at several posts here in the forums however none have problems like what We are facing.

Our goal is to get this output:

> df
  id                                          urls
1  1                                          
2  2                                          
3  3                                         PAGE_TITLE
4  4                              PAGE_TITLE PAGE_TITLE
5  5                                         PAGE_TITLE

I have tried this method that separates the urls, but expands the df which is not what I want:

urls_only_vector <- df %>%
                      mutate(urls= strsplit(as.character(urls), " ")) %>%
                      unnest(urls) #%>% select("urls")

Using this method I can read urls one at a time, but again, since it expands my dataframe, I was wondering if there is something else I can do? Can I get an hint please? I will cherish any help.

Ronak Shah · Accepted Answer

It is better to get url's in different rows, apply get_title_tag function get the title and combine the data again grouping by id so that size of data remains the same.

library(dplyr)

df %>%
  tidyr::separate_rows(urls, sep = '\s+') %>%
  mutate(title = purrr::map_chr(urls, get_title_tag)) %>%
  group_by(id) %>%
  summarise(title = toString(title))

replace a url (or a string containing multiple urls) with a value returned from a function

Answers (1)

Related Questions