Reputation: 3545
I want to scrape this web page with R and rvest. I want to extract the fifty words in this format:
So far I have been able to do this:
library(rvest)
library(dplyr)
words<-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/")
just_words<-words %>% html_nodes("ol") %>% html_text()
strsplit(just_words,"\r\n\t")
I am able to reach up to this output only, with all 50 words:
Any help?
EDIT
I am well aware of the Terms of Use of the website. This question is solely for practice purposes to improve scraping skills.
Upvotes: 0
Views: 274
Reputation: 3055
If you look at the details of the webpage (right-click and Inspect in Chrome), you will see that the word is provided in bold (wrapped in strong
child tag under the same li
node). Therefore it should be possible to fetch the two items separately.
library(rvest)
words <-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/") %>%
html_nodes("#article-detail3 li")
data.frame( words = words %>% html_node("strong") %>% html_text(),
meaning = words %>% html_node(xpath="./text()") %>% html_text())
The xpath=./text()
specifies that you want only text of parent, but not children nodes.
Upvotes: 1
Reputation: 5431
I converted just_words list into dataframe and then used separate
in tidyr
package to split the column.
library(rvest)
library(dplyr)
library(stringr)
library(tidyr)
words<-read_html("https://www.education.com/magazine/article/Ed_50_Words_SAT_Loves_2/")
just_words<-words %>% html_nodes("ol") %>% html_text()
x <- as.data.frame(strsplit(just_words,"\r\n\t"), col.names = "V1")
head(x)
t <- x %>% separate(V1, c("Word", "Meaning"), extra = "merge", fill = "left")
head(t)
Output:
> head(t)
Word Meaning
1 abstract not concrete
2 aesthetic having to do with the appreciation of beauty
3 alleviate to ease a pain or a burden
4 ambivalent simultaneously feeling opposing feelings; uncertain
5 apathetic feeling or showing little emotion
6 auspicious favorable; promising
If you are looking for a more formatted output, use pander package. The output displays as below:
> library(pander)
> pander(head(t))
---------------------------------------
Word Meaning
---------- ----------------------------
abstract not concrete
aesthetic having to do with the
appreciation of beauty
alleviate to ease a pain or a burden
ambivalent simultaneously feeling
opposing feelings; uncertain
apathetic feeling or showing little
emotion
auspicious favorable; promising
---------------------------------------
To remove line breaks and spaces.
t <- t %>% mutate(Meaning=gsub("[\r\n]", "", Meaning)) %>% tail()
Upvotes: 2