sprinklesx
sprinklesx

Reputation: 83

How do I clean and organize this list of scraped data?

My issue is that I've managed to, with great assistance from this community, scrape much of the data I desire; however, I have not managed to organize it in any meaningful way. The links I used in source are a sample of the many, many links I have for this project that are representative of all of them

library(rvest)
library(tidyverse)

#source links
source<-c("http://www.ufcstats.com/fighter-details/f2688492b9a525a3","http://www.ufcstats.com/fighter-details/f1fac969a1d70b08")

fp_e<-map(source, function(career_data){
  read_html(career_data)%>%
    html_nodes("div ul li")%>%
    html_text()%>%
    #cleans up the data a bit
    str_replace_all(.,"\n\\s+\n\\s+","")%>%
    as.data.frame(.)
})

What I want to do with this list is to turn it into a usable dataframe. My original idea was to transpose() it after as.data.frame(); however, all it did was put everything in a single row. Additionally, I was unable to index the data frame. This lead me to believe the data frame was not set up how I thought it was. I want to be more specific here but I'm honestly quite confused at this point.

Searching around, I found this question and the answer by neilfws gave me an idea of building the dataframe and inserting the data into it; however, I don't even know where to start. I'm also unsure if it's necessary to do that when it's already set up in a format I like.

This is the first real-world R application I've tried and I'm really stumbling on how to organize this data. Thank you for any help and suggestions!

Upvotes: 2

Views: 113

Answers (1)

Ronak Shah
Ronak Shah

Reputation: 389135

You can do some data-cleaning with tidyverse library :

library(tidyverse)
library(rvest)

map(source, function(career_data){
  read_html(career_data) %>%
    html_nodes("div ul li")%>%
    html_text() %>%
    trimws(whitespace = '[\\s\n]') %>%
    tibble(data = .) %>%
    separate(data, c('Property', 'Value'), sep = ':') %>%
    na.omit() %>%
    mutate(Value = trimws(Value, whitespace = '[\n\\s]'))
})

This returns :

#[[1]]
# A tibble: 13 x 2
#   Property  Value         
#   <chr>     <chr>         
# 1 Height    "5' 6\""      
# 2 Weight    "135 lbs."    
# 3 Reach     "68\""        
# 4 STANCE    "Orthodox"    
# 5 DOB       "Oct 16, 1981"
# 6 SLpM      "3.70"        
# 7 Str. Acc. "39%"         
# 8 SApM      "2.70"        
# 9 Str. Def  "66%"         
#10 TD Avg.   "2.28"        
#11 TD Acc.   "31%"         
#12 TD Def.   "65%"         
#13 Sub. Avg. "0.3"         

#[[2]]
# A tibble: 13 x 2
#   Property  Value         
#   <chr>     <chr>         
# 1 Height    "6' 2\""      
# 2 Weight    "170 lbs."    
# 3 Reach     "74\""        
# 4 STANCE    "Southpaw"    
# 5 DOB       "Aug 25, 1991"
# 6 SLpM      "2.53"        
# 7 Str. Acc. "47%"         
# 8 SApM      "2.05"        
# 9 Str. Def  "55%"         
#10 TD Avg.   "1.39"        
#11 TD Acc.   "31%"         
#12 TD Def.   "70%"         
#13 Sub. Avg. "0.4"         

Upvotes: 2

Related Questions