tangerine7199
tangerine7199

Reputation: 489

Web Scraping Notable Names

I'm trying to get the

from each site listed here: https://www.nndb.com/lists/494/000063305/

Here's an individual site so viewers can see the single page.

I'm trying to model my R code after this site but it's difficult because on the individual sites there aren't headings for Gender, for example. Can someone assist?

library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
b_dataset <- map_df(1:91, function(i) {
  page <- read_html(sprintf(url_base, i))
  data.frame(ICOname = html_text(html_nodes(page, ".name")))
})

Upvotes: 0

Views: 638

Answers (1)

Kim
Kim

Reputation: 4298

I'll take you halfway there: it's not too difficult to figure out from here.

library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"

First, the following will generate a list of A-Z surname list URLs, and then consequently each person's profile URLs.

## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
  html_nodes(".newslink") %>%
  html_attrs() %>%
  map(pluck(1, 1))

all_ppl_urls <- map(
  all_surname_urls, 
  function(x) read_html(x) %>%
    html_nodes("a") %>%
    html_attrs() %>%
    map(pluck(1, 1))
) %>% 
  unlist()

all_ppl_urls <- setdiff(
  all_ppl_urls[!duplicated(all_ppl_urls)], 
  c(all_surname_urls, "http://www.nndb.com/")
)

You are correct---there are no separate headings for gender or any other, really. You'll just have to use tools such as SelectorGadget to see what elements contain what you need. In this case it's simply p.

all_ppl_urls[1] %>%
  read_html() %>%
  html_nodes("p") %>%
  html_text()

The output will be

[1] "AKA Lee William Aaker"
[2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
[3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
[4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
...

Although the output is not clean, things rarely are when webscraping---this is actually relatively easier one. You can use series of grepl and map to subset the contents that you need, and make a dataframe out of them.

Upvotes: 1

Related Questions