Connor Watson
Connor Watson

Reputation: 115

R: rvest library to extract nested node content

This is a link to a journal page:
https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9
I'm trying to get the following: Author Affiliations (all authors), Corresponding Author, and Corresponding Author's Email. Note: it is assumed the corresponding author is the last author listed in the authors sections at the top of the article. I've used SelectorGadget to identify some tags for other elements like Abstract and Publication Date, but I just can't seem to figure out how to get these three. The following is my code to get the authors as a character vector:

#url is the url for the list of articles on a particular page
s <- html_session(url)<br >
page <- s %>% follow_link(art) %>% read_html()   <br > 
str_replace_all(str_squish(page %>% html_nodes(".AuthorName") %>% html_text()), "[0-9]|Email author", "")<br >

And this returns a vector of all authors involved, in this case of length 8 for each of the authors. But now I need to follow the links on their names to get the affiliations, and their emails. I'm sure all the code I need is in front of me but I'm a little lost as I'm new to R and web scraping (had to learn this quickly for my current project).

Update

The answer below is perfect.

Upvotes: 0

Views: 1322

Answers (1)

Jiaxiang
Jiaxiang

Reputation: 883

I am not sure the email address always matches the author at the last position. Because when I open the Chrome view-source, I find the email address somehow is below an independent list.

library(rvest)
#> 载入需要的程辑包:xml2
library(data.table)
library(tidyverse)
xml <- read_html('https://genomebiology.biomedcentral.com/articles/10.1186/s13059-018-1535-9')


xml %>% 
    html_nodes('.EmailAuthor') %>% 
    html_attr('href')
#> [1] "mailto:[email protected]"
    # get email address

xml %>% 
    html_nodes('.AuthorName') %>% 
    html_text
#> [1] "Ye<U+00A0>Yu"  "Jinpeng<U+00A0>Liu" "Xinan<U+00A0>Liu" "Yi<U+00A0>Zhang"
#> [5] "Eamonn<U+00A0>Magner" "Erik<U+00A0>Lehnert" "Chen<U+00A0>Qian" "Jinze<U+00A0>Liu"
    # get name

data.table(
    name = xml %>% 
        html_nodes('meta') %>% 
        html_attr('name')
    ,content = xml %>% 
        html_nodes('meta') %>% 
        html_attr('content')
) %>% 
    # extract both name and affiliatation, because make show they are matched.
    filter(name %in% c('citation_author_institution')) %>% 
    select(content)
#>                                                                                    content
#> 1                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 2                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 3                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 4                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 5                   Department of Computer Science, University of Kentucky, Lexington, USA
#> 6                                               Seven Bridges Genomics Inc, Cambridge, USA
#> 7 Department of Computer Engineering, University of California Santa Cruz, Santa Cruz, USA
#> 8                   Department of Computer Science, University of Kentucky, Lexington, USA

Created on 2018-11-02 by the reprex package (v0.2.1)

Upvotes: 1

Related Questions