user192085
user192085

Reputation: 117

Python web scraping html with xpath syntax issue

I'm new to python and trying to pull the billboard hot 100 list. I know there is a library already, but I'm practicing (and its done differently). My issue is that Billboard's list of songs doesn't match up with the artists because the syntax of selecting the artist changes between an "a" element and a "span" element. How do I include both types of elements which both contain [@class="chart-row__artist"].

Currently I have:

artists = [x.strip() for x in tree.xpath('//a[@class="chart-row__artist"]/text()')]

but this pulls up songs as well with span:

artists = [x.strip() for x in tree.xpath('//span[@class="chart-row__artist"]/text()')]

It alternates on the page. Any suggestions?

Upvotes: 2

Views: 59

Answers (2)

Luke
Luke

Reputation: 774

Is using xpath necessary? I got a list of all artists with bs4 pretty easily.

import requests
from bs4 import BeautifulSoup

response = requests.get('https://www.billboard.com/charts/hot-100')
soup = BeautifulSoup(response.content, 'lxml')
artists = [row.text.strip() for row in soup.select('.chart-row__artist')]
print(artists)

Upvotes: 0

user192085
user192085

Reputation: 117

I think I got the syntax for XPath right. It seems like the songs are matching appropriately with artists despite the alternating element nodes for artists. I did this:

artists = [x.strip() for x in tree.xpath('//*[@class="chart-row__artist"]/text()')]

The prefix //* chose the whole document then matched against the class name, so this covered both 'a' elements and 'span' elements.

Upvotes: 1

Related Questions