Reputation: 83
Here's my problem: I have this list I've generated containing a large number of links and I want to take this list and apply a function to it to scrape some data from all those links; however, when I run the program it only takes the data from the first link of that element, reprinting that info for the correct number of iterations. Here's all my code so far:
library(tidyverse)
library(rvest)
source_link<-"http://www.ufcstats.com/statistics/fighters?char=a&page=all"
source_link_html<-read_html(source_link)
#This scrapes all the links for the pages of all the fighters
links_vector<-source_link_html%>%
html_nodes("div ul li a")%>%
html_attr("href")%>%
#This seq selects the 26 needed links, i.e. from a-z
.[1:26]
#Modifies the pulled data so the links become useable and contain all the fighers instead of just some
links_vector_modded<-str_c("http://www.ufcstats.com", links_vector,"&page=all")
fighter_links<-sapply(links_vector_modded, function(links_vector_modded){
read_html(links_vector_modded[])%>%
html_nodes("tr td a")%>%
html_attr("href")%>%
.[seq(1,length(.),3)]%>%
na.omit(fighter_links)
})
###Next Portion: Using the above links to further harvest
#Take all the links within an element of fighter_links and run it through the function career_data to scrape all the statistics from said pages.
fighter_profiles_a<-map(fighter_links$`http://www.ufcstats.com/statistics/fighters?char=a&page=all`, function(career_data){
#Below is where I believe my problem lies
read_html()%>%
html_nodes("div ul li")%>%
html_text()
})
The issue I'm having is in the last section of code,read_html()
. I do not know how to apply each link in the element within the list to that function. Additionally, is there a way to call all of the elements of fighter_links
instead of doing it one element at a time?
Thank you for any advice and assistance!
Upvotes: 1
Views: 301
Reputation: 21274
The challenge is that fighter_links
is a list of vectors. Applying map
to each list element leaves you with a vector of URLs, and you want to get information from each URL.
If it's important to retain the structure of fighter_links
- meaning, you don't lose which URL belongs to each fighter - you can nest your call to map
, like this:
fighter_profiles <-
fighter_links %>%
map(function(url_list) {
map(url_list,
function(url) read_html("http://www.ufcstats.com/fighter-details/93fe7332d16c6ad9"[]) %>%
html_nodes("div ul li") %>%
html_text() %>%
str_replace_all(., "\n\\s+\n\\s+", "")) # a little clean up
})
This produces nested output, which you can use to keep track of which fighter_links
entry it came from:
[[1]]
[[1]][[1]]
[1] "Height:--\n " "Weight:155 lbs.\n " "Reach:--\n " "STANCE:"
[5] "DOB:Jul 13, 1978" "SLpM:0.00\n\n " "Str. Acc.:0%\n " "SApM:0.00\n "
[9] "Str. Def:0%\n " "" "TD Avg.:0.00\n " "TD Acc.:0%\n "
[13] "TD Def.:0%\n " "Sub. Avg.:0.0\n " "Events & Fights" "Fighters"
[17] "Stat Leaders"
[[1]][[2]]
[1] "Height:--\n " "Weight:155 lbs.\n " "Reach:--\n " "STANCE:"
[5] "DOB:Jul 13, 1978" "SLpM:0.00\n\n " "Str. Acc.:0%\n " "SApM:0.00\n "
[9] "Str. Def:0%\n " "" "TD Avg.:0.00\n " "TD Acc.:0%\n "
[13] "TD Def.:0%\n " "Sub. Avg.:0.0\n " "Events & Fights" "Fighters"
[17] "Stat Leaders"
Note: You can use map
instead of the initial sapply
as well, if you like:
path <- "http://www.ufcstats.com/statistics/fighters"
query_str <- paste0("?char=", letters, "&page=all")
urls <- paste0(path, query_str)
get_fighter_link <- function(url) {
read_html(url[])%>%
html_nodes("tr td a")%>%
html_attr("href")%>%
.[seq(1, length(.), 3)]%>%
na.omit(fighter_links)
}
fighter_links <- map(urls, get_fighter_link)
Upvotes: 1
Reputation: 389135
You can unlist
to get all the fighter_links
together and pass it to map
function to extract relevant text.
library(rvest)
library(purrr)
fighter_profiles_a<-map(unlist(fighter_links), function(career_data){
read_html(career_data)%>%
html_nodes("div ul li")%>%
html_text()
})
The text captured at fighter_profiles_a
might require some additional cleaning.
Upvotes: 1