Reputation: 392
this is my first attempt to deal with HTML and CSS selectors. I am using the R package rvest to scrap the Billboard Top 100 website. Some of the data that I am interested in include this weeks rank, song, weather or not the song is New, and weather or not the song has any awards.
I am able to get the song name and rank with the following:
library(rvest)
URL <- "http://www.billboard.com/charts/hot-100/2017-09-30"
webpage <- read_html(URL)
current_week_rank <- html_nodes(webpage, '.chart-row__current-week')
current_week_rank <- as.numeric(html_text(current_week_rank))
My problem comes with the new and award indicators. The songs are listed in rows with each of the 100 contained in:
<article> class="chart-row char-row--1 js chart-row" ....
</article>
If a song is new, this will have class within it like:
<div class="chart-row__new-indicator">
If a song has an award, there will be this class within it:
<div class="chart-row__award-indicator">
Is there a way that I can look at all 100 instances of the class="chart-row char-row--1 js chart-row" ... and see if either of these exist within it? The output that I get from the current_week_rank is one column of 100 values. I am hoping that there is a way to get this so that I have one observation for each song.
Thank you for any help or advice.
Upvotes: 1
Views: 1853
Reputation: 34733
Basically amounts to a tailored version of the Q&A I indicated above. I can't tell for 100% certain whether the or
is working as intended, since there's only one row in your example page with a <div class="chart-row__new-indicator">
, and that row also happens to have a <div class="chart-row__award-indicator">
tag as well.
#xpath to focus on the 100 rows of interest
primary_xp = '//div[@class="chart-row__primary"]'
#xpath which subselects rows you're after
check_xp = paste('div[@class="chart-row__award-indicator" or' ,
'@class="chart-row__new-indicator"]')
webpage %>% html_nodes(xpath = primary_xp) %>%
#row__primary for which there are no such child nodes
# will come back NA, and hence so will html_attr('class')
html_node(xpath = check_xp) %>%
#! is a bit extraneous, as it only flips FALSE to TRUE
# for the rows you're after (necessity depends on
# particulars of your application)
html_attr('class') %>% is.na %>% `!`
FWIW, you may be able to shorten check_xp
to the following:
check_xp = 'div[contains(@class, "indicator")]'
Which certainly covers both "chart-row__award-indicator"
and "chart-row__new-indicator"
, but would also wrap up other nodes with a class
containing "indicator"
, if such an alternative tag exists (you'll have to determine this for yourself)
Upvotes: 3