Suman
Suman

Reputation: 3545

Webscraping with R

I want to scrape this web page with R and rvest. I want to extract the fifty words in this format:

enter image description here

So far I have been able to do this:

library(rvest)
library(dplyr)
words<-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/")
just_words<-words %>% html_nodes("ol") %>% html_text()
strsplit(just_words,"\r\n\t")

I am able to reach up to this output only, with all 50 words:

enter image description here

Any help?

EDIT

I am well aware of the Terms of Use of the website. This question is solely for practice purposes to improve scraping skills.

Upvotes: 0

Views: 274

Answers (2)

dmi3kno
dmi3kno

Reputation: 3055

If you look at the details of the webpage (right-click and Inspect in Chrome), you will see that the word is provided in bold (wrapped in strong child tag under the same li node). Therefore it should be possible to fetch the two items separately.

library(rvest)
words <-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/") %>% 
  html_nodes("#article-detail3 li")

data.frame( words = words %>% html_node("strong") %>% html_text(),
            meaning = words %>% html_node(xpath="./text()") %>% html_text())

The xpath=./text() specifies that you want only text of parent, but not children nodes.

Upvotes: 1

Rohan
Rohan

Reputation: 5431

I converted just_words list into dataframe and then used separate in tidyr package to split the column.

library(rvest)
library(dplyr)
library(stringr)
library(tidyr)
words<-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/")
just_words<-words %>% html_nodes("ol") %>% html_text()
x <- as.data.frame(strsplit(just_words,"\r\n\t"), col.names = "V1")
head(x)
t <- x %>% separate(V1, c("Word", "Meaning"), extra = "merge", fill = "left")
head(t)

Output:

> head(t)
        Word                                             Meaning
1   abstract                                        not concrete
2  aesthetic        having to do with the appreciation of beauty
3  alleviate                          to ease a pain or a burden
4 ambivalent simultaneously feeling opposing feelings; uncertain
5  apathetic                   feeling or showing little emotion
6 auspicious                                favorable; promising

If you are looking for a more formatted output, use pander package. The output displays as below:

> library(pander)
> pander(head(t))

---------------------------------------
   Word              Meaning           
---------- ----------------------------
 abstract          not concrete        

aesthetic     having to do with the    
              appreciation of beauty   

alleviate   to ease a pain or a burden 

ambivalent    simultaneously feeling   
           opposing feelings; uncertain

apathetic   feeling or showing little  
                     emotion           

auspicious     favorable; promising    
---------------------------------------

To remove line breaks and spaces.

t <- t %>% mutate(Meaning=gsub("[\r\n]", "", Meaning)) %>% tail()

Upvotes: 2

Related Questions