Web Scraping Notable Names

Question

I'm trying to get the

Gender
Race or Ethnicity
Sexual orientation
Occupation
Nationality

from each site listed here: https://www.nndb.com/lists/494/000063305/

Here's an individual site so viewers can see the single page.

I'm trying to model my R code after this site but it's difficult because on the individual sites there aren't headings for Gender, for example. Can someone assist?

library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
b_dataset <- map_df(1:91, function(i) {
  page <- read_html(sprintf(url_base, i))
  data.frame(ICOname = html_text(html_nodes(page, ".name")))
})

Kim · Accepted Answer

I'll take you halfway there: it's not too difficult to figure out from here.

library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"

First, the following will generate a list of A-Z surname list URLs, and then consequently each person's profile URLs.

## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
  html_nodes(".newslink") %>%
  html_attrs() %>%
  map(pluck(1, 1))

all_ppl_urls <- map(
  all_surname_urls, 
  function(x) read_html(x) %>%
    html_nodes("a") %>%
    html_attrs() %>%
    map(pluck(1, 1))
) %>% 
  unlist()

all_ppl_urls <- setdiff(
  all_ppl_urls[!duplicated(all_ppl_urls)], 
  c(all_surname_urls, "http://www.nndb.com/")
)

You are correct---there are no separate headings for gender or any other, really. You'll just have to use tools such as SelectorGadget to see what elements contain what you need. In this case it's simply p.

all_ppl_urls[1] %>%
  read_html() %>%
  html_nodes("p") %>%
  html_text()

The output will be

[1] "AKA Lee William Aaker"
[2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
[3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
[4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
...

Although the output is not clean, things rarely are when webscraping---this is actually relatively easier one. You can use series of grepl and map to subset the contents that you need, and make a dataframe out of them.

Web Scraping Notable Names

Answers (1)

Related Questions