sprinklesx
sprinklesx

Reputation: 83

Using map() in R to apply a list to function for web scraping

Here's my problem: I have this list I've generated containing a large number of links and I want to take this list and apply a function to it to scrape some data from all those links; however, when I run the program it only takes the data from the first link of that element, reprinting that info for the correct number of iterations. Here's all my code so far:

library(tidyverse)
library(rvest)

source_link<-"http://www.ufcstats.com/statistics/fighters?char=a&page=all"
source_link_html<-read_html(source_link)

#This scrapes all the links for the pages of all the fighters
links_vector<-source_link_html%>%
  html_nodes("div ul li a")%>%
  html_attr("href")%>%
  #This seq selects the 26 needed links, i.e. from a-z
  .[1:26]

#Modifies the pulled data so the links become useable and contain all the fighers instead of just some
links_vector_modded<-str_c("http://www.ufcstats.com", links_vector,"&page=all")

fighter_links<-sapply(links_vector_modded, function(links_vector_modded){
  read_html(links_vector_modded[])%>%
  html_nodes("tr td a")%>%
  html_attr("href")%>%
  .[seq(1,length(.),3)]%>%
  na.omit(fighter_links)
})

###Next Portion: Using the above links to further harvest

#Take all the links within an element of fighter_links and run it through the function career_data to scrape all the statistics from said pages.
fighter_profiles_a<-map(fighter_links$`http://www.ufcstats.com/statistics/fighters?char=a&page=all`, function(career_data){
  #Below is where I believe my problem lies
  read_html()%>%
  html_nodes("div ul li")%>%
  html_text() 
})

The issue I'm having is in the last section of code,read_html(). I do not know how to apply each link in the element within the list to that function. Additionally, is there a way to call all of the elements of fighter_links instead of doing it one element at a time?

Thank you for any advice and assistance!

Upvotes: 1

Views: 301

Answers (2)

andrew_reece
andrew_reece

Reputation: 21274

The challenge is that fighter_links is a list of vectors. Applying map to each list element leaves you with a vector of URLs, and you want to get information from each URL.

If it's important to retain the structure of fighter_links - meaning, you don't lose which URL belongs to each fighter - you can nest your call to map, like this:

fighter_profiles <- 
  fighter_links %>%
    map(function(url_list) {
      map(url_list,
           function(url) read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9"[]) %>% 
             html_nodes("div ul li") %>% 
             html_text() %>%
             str_replace_all(., "\n\\s+\n\\s+", "")) # a little clean up
    })

This produces nested output, which you can use to keep track of which fighter_links entry it came from:

[[1]]
[[1]][[1]]
 [1] "Height:--\n    "         "Weight:155 lbs.\n    "   "Reach:--\n    "          "STANCE:"                
 [5] "DOB:Jul 13, 1978"        "SLpM:0.00\n\n        "   "Str. Acc.:0%\n        "  "SApM:0.00\n        "    
 [9] "Str. Def:0%\n        "   ""                        "TD Avg.:0.00\n        "  "TD Acc.:0%\n        "   
[13] "TD Def.:0%\n        "    "Sub. Avg.:0.0\n        " "Events & Fights"         "Fighters"               
[17] "Stat Leaders"           

[[1]][[2]]
 [1] "Height:--\n    "         "Weight:155 lbs.\n    "   "Reach:--\n    "          "STANCE:"                
 [5] "DOB:Jul 13, 1978"        "SLpM:0.00\n\n        "   "Str. Acc.:0%\n        "  "SApM:0.00\n        "    
 [9] "Str. Def:0%\n        "   ""                        "TD Avg.:0.00\n        "  "TD Acc.:0%\n        "   
[13] "TD Def.:0%\n        "    "Sub. Avg.:0.0\n        " "Events & Fights"         "Fighters"               
[17] "Stat Leaders"           

Note: You can use map instead of the initial sapply as well, if you like:

path <- "http://www.ufcstats.com/statistics/fighters"
query_str <- paste0("?char=", letters, "&page=all")
urls <- paste0(path, query_str)

get_fighter_link <- function(url) {
  read_html(url[])%>%
    html_nodes("tr td a")%>%
    html_attr("href")%>%
    .[seq(1, length(.), 3)]%>%
    na.omit(fighter_links)
}

fighter_links <- map(urls, get_fighter_link)

Upvotes: 1

Ronak Shah
Ronak Shah

Reputation: 389135

You can unlist to get all the fighter_links together and pass it to map function to extract relevant text.

library(rvest)
library(purrr)

fighter_profiles_a<-map(unlist(fighter_links), function(career_data){
  read_html(career_data)%>%
    html_nodes("div ul li")%>%
    html_text() 
})

The text captured at fighter_profiles_a might require some additional cleaning.

Upvotes: 1

Related Questions