Reputation: 489
I'm trying to get the
from each site listed here: https://www.nndb.com/lists/494/000063305/
Here's an individual site so viewers can see the single page.
I'm trying to model my R code after this site but it's difficult because on the individual sites there aren't headings for Gender, for example. Can someone assist?
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
b_dataset <- map_df(1:91, function(i) {
page <- read_html(sprintf(url_base, i))
data.frame(ICOname = html_text(html_nodes(page, ".name")))
})
Upvotes: 0
Views: 638
Reputation: 4298
I'll take you halfway there: it's not too difficult to figure out from here.
library(purrr)
library(rvest)
url_base <- "https://www.nndb.com/lists/494/000063305/"
First, the following will generate a list of A-Z surname list URLs, and then consequently each person's profile URLs.
## Gets A-Z links
all_surname_urls <- read_html(url_base) %>%
html_nodes(".newslink") %>%
html_attrs() %>%
map(pluck(1, 1))
all_ppl_urls <- map(
all_surname_urls,
function(x) read_html(x) %>%
html_nodes("a") %>%
html_attrs() %>%
map(pluck(1, 1))
) %>%
unlist()
all_ppl_urls <- setdiff(
all_ppl_urls[!duplicated(all_ppl_urls)],
c(all_surname_urls, "http://www.nndb.com/")
)
You are correct---there are no separate headings for gender or any other, really. You'll just have to use tools such as SelectorGadget to see what elements contain what you need. In this case it's simply p
.
all_ppl_urls[1] %>%
read_html() %>%
html_nodes("p") %>%
html_text()
The output will be
[1] "AKA Lee William Aaker"
[2] "Born: 25-Sep-1943Birthplace: Los Angeles, CA"
[3] "Gender: MaleRace or Ethnicity: WhiteOccupation: Actor"
[4] "Nationality: United StatesExecutive summary: The Adventures of Rin Tin Tin"
...
Although the output is not clean, things rarely are when webscraping---this is actually relatively easier one. You can use series of grepl
and map
to subset the contents that you need, and make a dataframe out of them.
Upvotes: 1