Reputation: 374
I am trying to obtain the subgenres and bands from this wikipedia article. https://es.wikipedia.org/wiki/Indie_rock (Subgéneros y características)
What I want to do is to store the 2 bands of each subgenre.
Say:
Chamber pop: The Decemberists, Andrew Bird, Belle & Sebastian, Vampire Weekend, Arcade Fire, Fiona Apple, Tori Amos.
Dance-punk: The Rapture, Death from Above 1979, Liars, LCD Soundsystem, !!!, Shitdisco, Datarock, Hot Chip, Wave Machines.
I'd like to know how to get the "The Decemberists, Andrew Bird" and "The Rapture, Death from Above 1979" from Chamber pop and dance-punk in a list that looks like:
Chamber_Pop = ["The Decemberists", "Andrew Bird"]
Dance_punk = ["The Rapture", "Death from Above 1979"]
What I've tried so far is:
url = "https://es.wikipedia.org/wiki/Indie_rock"
answer= requests.get(url)
html = answer.content
soup = BeautifulSoup(html, "lxml")
heading = soup.find(id='Subgéneros_y_características')
ul = heading.find_next('ul')
b_tag = ul.find_all('b')
for x in ul.find_all('li'):
titles.append((x.a['title']))
for y in x.find_all('a'):
titles.append(y)
With this I collect all and I put the name on each so I know which one it is. Output:
['Chamber pop',
<a href="/wiki/Chamber_pop" title="Chamber pop">Chamber pop</a>,
<a href="#cite_note-5"><span class="corchete-llamada">[</span>5<span class="corchete-llamada">]</span></a>,
<a href="/wiki/The_Decemberists" title="The Decemberists">The Decemberists</a>,
<a href="/wiki/Andrew_Bird" title="Andrew Bird">Andrew Bird</a>,
<a class="mw-redirect" href="/wiki/Belle_%26_Sebastian" title="Belle & Sebastian">Belle & Sebastian</a>,
<a href="/wiki/Vampire_Weekend" title="Vampire Weekend">Vampire Weekend</a>,
<a href="/wiki/Arcade_Fire" title="Arcade Fire">Arcade Fire</a>,
<a href="/wiki/Fiona_Apple" title="Fiona Apple">Fiona Apple</a>,
<a href="/wiki/Tori_Amos" title="Tori Amos">Tori Amos</a>,
'Dance-punk',
<a class="mw-redirect" href="/wiki/Dance-punk" title="Dance-punk">Dance-punk</a>,
<a href="#cite_note-6"><span class="corchete-llamada">[</span>6<span class="corchete-llamada">]</span></a>,
<a href="/wiki/The_Rapture" title="The Rapture">The Rapture</a>,
<a class="mw-redirect" href="/wiki/Death_from_Above_1979" title="Death from Above 1979">Death from Above 1979</a>,
<a href="/wiki/Liars_(banda)" title="Liars (banda)">Liars</a>,
<a href="/wiki/LCD_Soundsystem" title="LCD Soundsystem">LCD Soundsystem</a>,
<a href="/wiki/!!!" title="!!!">!!!</a>,
<a href="/wiki/Shitdisco" title="Shitdisco">Shitdisco</a>,
<a href="/wiki/Datarock" title="Datarock">Datarock</a>,
<a href="/wiki/Hot_Chip" title="Hot Chip">Hot Chip</a>]...
Is there a better way of scraping only the 2 elements of every subgenre?
Thank you
Upvotes: 1
Views: 66
Reputation: 195573
To print all genres and first two bands of each genre, you can use this example:
import requests
from bs4 import BeautifulSoup
url = 'https://es.wikipedia.org/wiki/Indie_rock'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')
print('{:<30} {:<30} {}'.format('Genre', 'Band 1', 'Band 2'))
for item in soup.select('h2:has(#Subgéneros_y_características) ~ ul:nth-of-type(1) li'):
bands = [band.get_text(strip=True) for band in item.select('a, i') if band.parent == item]
print('{:<30} {:<30} {}'.format(item.b.get_text(strip=True), bands[0], bands[1]))
Prints:
Genre Band 1 Band 2
Chamber pop The Decemberists Andrew Bird
Dance-punk The Rapture Death from Above 1979
Dream pop Cocteau Twins Spiritualized
Dunedin Sound Superette Garageland
Indie pop Enon The Hush Sound
Garage rock Los Growlers Count Five
Madchester The Stone Roses Happy Mondays
Math rock Foals Don Caballero
Indie folk Beirut CocoRosie
Neo-psicodelia Tame Impala The Flaming Lips
No wave Glenn Branca Lydia Lunch
Noise rock Sonic Youth Butthole Surfers
Post-punk The Cure Siouxsie And The Banshees
Post-punk revival Interpol The Walkmen
Post-rock Bowery Electric Sigur Rós
Sadcoreo (Slowcore) Copeland Pedro the Lion
Shoegazing My Bloody Valentine Slowdive
Twee pop Camera Obscura The Flaming Lips
Upvotes: 1