js352
js352

Reputation: 374

Getting elements BeautifulSoup 4

I am trying to obtain the subgenres and bands from this wikipedia article. https://es.wikipedia.org/wiki/Indie_rock (Subgéneros y características)

What I want to do is to store the 2 bands of each subgenre.

Say:

Chamber pop:​ The Decemberists, Andrew Bird, Belle & Sebastian, Vampire Weekend, Arcade Fire, Fiona Apple, Tori Amos. 
Dance-punk:​ The Rapture, Death from Above 1979, Liars, LCD Soundsystem, !!!, Shitdisco, Datarock, Hot Chip, Wave Machines.

I'd like to know how to get the "The Decemberists, Andrew Bird" and "The Rapture, Death from Above 1979" from Chamber pop and dance-punk in a list that looks like:

Chamber_Pop = ["The Decemberists", "Andrew Bird"]
Dance_punk = ["The Rapture", "Death from Above 1979"]

What I've tried so far is:

url = "https://es.wikipedia.org/wiki/Indie_rock"

answer= requests.get(url)

html = answer.content

soup = BeautifulSoup(html, "lxml")

heading = soup.find(id='Subgéneros_y_características')
ul = heading.find_next('ul')
b_tag = ul.find_all('b')

for x in ul.find_all('li'):
  titles.append((x.a['title']))
  for y in x.find_all('a'):
    titles.append(y)

With this I collect all and I put the name on each so I know which one it is. Output:

['Chamber pop',
 <a href="/wiki/Chamber_pop" title="Chamber pop">Chamber pop</a>,
 <a href="#cite_note-5"><span class="corchete-llamada">[</span>5<span class="corchete-llamada">]</span></a>,
 <a href="/wiki/The_Decemberists" title="The Decemberists">The Decemberists</a>,
 <a href="/wiki/Andrew_Bird" title="Andrew Bird">Andrew Bird</a>,
 <a class="mw-redirect" href="/wiki/Belle_%26_Sebastian" title="Belle &amp; Sebastian">Belle &amp; Sebastian</a>,
 <a href="/wiki/Vampire_Weekend" title="Vampire Weekend">Vampire Weekend</a>,
 <a href="/wiki/Arcade_Fire" title="Arcade Fire">Arcade Fire</a>,
 <a href="/wiki/Fiona_Apple" title="Fiona Apple">Fiona Apple</a>,
 <a href="/wiki/Tori_Amos" title="Tori Amos">Tori Amos</a>,
 'Dance-punk',
 <a class="mw-redirect" href="/wiki/Dance-punk" title="Dance-punk">Dance-punk</a>,
 <a href="#cite_note-6"><span class="corchete-llamada">[</span>6<span class="corchete-llamada">]</span></a>,
 <a href="/wiki/The_Rapture" title="The Rapture">The Rapture</a>,
 <a class="mw-redirect" href="/wiki/Death_from_Above_1979" title="Death from Above 1979">Death from Above 1979</a>,
 <a href="/wiki/Liars_(banda)" title="Liars (banda)">Liars</a>,
 <a href="/wiki/LCD_Soundsystem" title="LCD Soundsystem">LCD Soundsystem</a>,
 <a href="/wiki/!!!" title="!!!">!!!</a>,
 <a href="/wiki/Shitdisco" title="Shitdisco">Shitdisco</a>,
 <a href="/wiki/Datarock" title="Datarock">Datarock</a>,
 <a href="/wiki/Hot_Chip" title="Hot Chip">Hot Chip</a>]...

Is there a better way of scraping only the 2 elements of every subgenre?

Thank you

Upvotes: 1

Views: 66

Answers (1)

Andrej Kesely
Andrej Kesely

Reputation: 195573

To print all genres and first two bands of each genre, you can use this example:

import requests
from bs4 import BeautifulSoup


url = 'https://es.wikipedia.org/wiki/Indie_rock'
soup = BeautifulSoup(requests.get(url).content, 'html.parser')

print('{:<30} {:<30} {}'.format('Genre', 'Band 1', 'Band 2'))
for item in soup.select('h2:has(#Subgéneros_y_características) ~ ul:nth-of-type(1) li'):
    bands = [band.get_text(strip=True) for band in item.select('a, i') if band.parent == item]
    print('{:<30} {:<30} {}'.format(item.b.get_text(strip=True), bands[0], bands[1]))

Prints:

Genre                          Band 1                         Band 2
Chamber pop                    The Decemberists               Andrew Bird
Dance-punk                     The Rapture                    Death from Above 1979
Dream pop                      Cocteau Twins                  Spiritualized
Dunedin Sound                  Superette                      Garageland
Indie pop                      Enon                           The Hush Sound
Garage rock                    Los Growlers                   Count Five
Madchester                     The Stone Roses                Happy Mondays
Math rock                      Foals                          Don Caballero
Indie folk                     Beirut                         CocoRosie
Neo-psicodelia                 Tame Impala                    The Flaming Lips
No wave                        Glenn Branca                   Lydia Lunch
Noise rock                     Sonic Youth                    Butthole Surfers
Post-punk                      The Cure                       Siouxsie And The Banshees
Post-punk revival              Interpol                       The Walkmen
Post-rock                      Bowery Electric                Sigur Rós
Sadcoreo (Slowcore)            Copeland                       Pedro the Lion
Shoegazing                     My Bloody Valentine            Slowdive
Twee pop                       Camera Obscura                 The Flaming Lips

Upvotes: 1

Related Questions