Canovice
Canovice

Reputation: 10441

In R, get messy data scraped and organized into data frame

We are trying to scrape general info on college basketball coaches. Here are two example pages that I am trying to scrape:

Our ideal output is:

data.frame(
  name = c('Mark Schmidt', 'Sean Neal', 'Matt Pappano', 'Steve Curran', 'Tray Woodall', NA, 'Dominique Broadus'),
  title = c("Head Men's Basketball Coach", "Assistant Men's Basketball Coach", "Director Of Basketball Operations", "Associate Head Coach, Men's Basketball", "Assistant Men's Basketball Coach", "Head Women's Basketball Coach", "Assistant Women's Basketball Coach"),
  email = c(NA, '[email protected]', '[email protected]', '[email protected]', '[email protected]', NA, '[email protected]'),
  phone = c('716-375-2207', '716-375-2257', '716-375-2218', '716-375-2258', '716-375-2259', '479-979-1325', '479-979-1325'),
  stringsAsFactors = FALSE
)

               name                                  title               email        phone
1      Mark Schmidt            Head Men's Basketball Coach                <NA> 716-375-2207
2         Sean Neal       Assistant Men's Basketball Coach       [email protected] 716-375-2257
3      Matt Pappano      Director Of Basketball Operations    [email protected] 716-375-2218
4      Steve Curran Associate Head Coach, Men's Basketball     [email protected] 716-375-2258
5      Tray Woodall       Assistant Men's Basketball Coach    [email protected] 716-375-2259
6              <NA>          Head Women's Basketball Coach                <NA> 479-979-1325
7 Dominique Broadus     Assistant Women's Basketball Coach [email protected] 479-979-1325

This is causing us issues for a few reasons:

Here's what we got so far:

# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')

page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')


# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page1_bios_df <- unlist(page1_list) %>% as.data.frame()

page2_list <- lapply(page2_bios, function(x) paste(x %>% html_children() %>% html_text(), collapse = " "))
page2_bios_df <- unlist(page2_list) %>% as.data.frame()

We are not all that close, and in fact we're not quite certain if this is even possible to do. I think we need to first get the data into a dataframe even if the columns names are wrong, and then examine the contents of the columns (e.g. look for @ symbols for emails, for #s for phone numbers, for the word "coach" for titles, etc.) to try to name them correctly.

Upvotes: 1

Views: 46

Answers (2)

Peace Wang
Peace Wang

Reputation: 2419

I can only open the first url, so my solution is as follow

library(rvest)
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()

name <- page1 %>% 
    html_nodes(css = "div.coach-bios-wrapper.clearfix span.name") %>%
    html_text()

title <- page1 %>% html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div > p:nth-child(2)") %>%
    html_text()

email <- page1 %>% 
    html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
    html_text() %>%
    gsub(".*\n(.*@.*)\nPhone.*","\\1",.)

email[grep("@",email,invert = T)] <- NA

phone <- page1 %>% 
    html_nodes(css = "div.coach-bios-wrapper.clearfix > div > div > div > div > div") %>%
    html_text() %>%
        gsub(".*\nPhone: (.*)\n.*","\\1",.)

df <- data.frame(name,title,email,phone)
# df$email[which(!grepl("@",df$email))] <- NA
df
#>           name                                  title            email
#> 1 Mark Schmidt            Head Men's Basketball Coach             <NA>
#> 2 Steve Curran Associate Head Coach, Men's Basketball  [email protected]
#> 3    Sean Neal       Assistant Men's Basketball Coach    [email protected]
#> 4 Tray Woodall       Assistant Men's Basketball Coach [email protected]
#> 5 Matt Pappano      Director Of Basketball Operations [email protected]
#>          phone
#> 1 716-375-2207
#> 2 716-375-2258
#> 3 716-375-2257
#> 4 716-375-2259
#> 5 716-375-2218

Created on 2021-07-17 by the reprex package (v2.0.0)

Upvotes: 1

stefan
stefan

Reputation: 125038

One option to achieve your desired result may look like so. Basically my approach extracts the desired information piece by piece using specific CSS selectors:

library(rvest)
library(magrittr)

# go to pages, grab person bios
page1 <- 'https://gobonnies.sbu.edu/sports/m-baskbl/coaches/index' %>% read_html()
page1_bios <- page1 %>% html_nodes('div.coach-bios .coach-bio .info')

page2 <- 'https://uofoathletics.com/sports/wbkb/coaches/index' %>% read_html()
page2_bios <- page2 %>% html_nodes('div.coach-bios .coach-bio .info')

get_bios <- function(x) {
  data.frame(
    name = x %>% html_node("span.name") %>% html_text(),
    title = x %>% html_node("p:nth-of-type(2)") %>% html_text(),
    email = x %>% html_node("p.email a") %>% html_attr("href"),
    phone = x %>% html_node("p:last-of-type") %>% html_text()
  )
}


# turn bios into 1-column dataframes (not really what we need)
page1_list <- lapply(page1_bios, get_bios)
page2_list <- lapply(page2_bios, get_bios)

bios_df <- do.call("rbind", c(page1_list, page2_list))

bios_df$email <- gsub("^mailto:(.*)$", "\\1", bios_df$email)
bios_df$phone <- gsub("^Phone:\\s(.*)$", "\\1", bios_df$phone)

bios_df
#>                name                                  title               email
#> 1      Mark Schmidt            Head Men's Basketball Coach                <NA>
#> 2      Steve Curran Associate Head Coach, Men's Basketball     [email protected]
#> 3         Sean Neal       Assistant Men's Basketball Coach       [email protected]
#> 4      Tray Woodall       Assistant Men's Basketball Coach    [email protected]
#> 5      Matt Pappano      Director Of Basketball Operations    [email protected]
#> 6                            Head Women's Basketball Coach                <NA>
#> 7 Dominique Broadus     Assistant Women's Basketball Coach [email protected]
#>          phone
#> 1 716-375-2207
#> 2 716-375-2258
#> 3 716-375-2257
#> 4 716-375-2259
#> 5 716-375-2218
#> 6 479-979-1325
#> 7 479-979-1325

Upvotes: 1

Related Questions