Andrew Brēza
Andrew Brēza

Reputation: 8317

Scraping the content of all div tags with a specific class

I'm scraping all the text from a website that occurs in a specific class of div. In the following example, I want to extract everything that's in a div of class "a".

site <- "<div class='a'>Hello, world</div>
  <div class='b'>Good morning, world</div>
  <div class='a'>Good afternoon, world</div>"

My desired output is...

"Hello, world"
"Good afternoon, world"

The code below extracts the text from every div, but I can't figure out how to include only class="a".

library(tidyverse)
library(rvest)

site %>% 
  read_html() %>% 
  html_nodes("div") %>% 
  html_text()

# [1] "Hello, world"          "Good morning, world"   "Good afternoon, world"

With Python's BeautifulSoup, it would look something like site.find_all("div", class_="a").

Upvotes: 15

Views: 15953

Answers (2)

DJack
DJack

Reputation: 4940

site %>% 
  read_html() %>% 
  html_nodes(xpath = '//*[@class="a"]') %>% 
  html_text()

Upvotes: 6

neilfws
neilfws

Reputation: 33782

The CSS selector for div with class = "a" is div.a:

site %>% 
  read_html() %>% 
  html_nodes("div.a") %>% 
  html_text()

Or you can use XPath:

html_nodes(xpath = "//div[@class='a']")

Upvotes: 24

Related Questions