Aeox
Aeox

Reputation: 21

BeautifulSoup: Extracting text from nested tags

Long time lurker, first time poster. I spent some time looking over related questions but I still couldn't seem to figure this out. I think it's easy enough but please forgive me, I'm still a bit of a BeautifulSoup/python n00b.

I have a text file of URLs I parsed from a previous webscraping exercise that I'd like to search through and extract the text contents of a list item (<li>) based on a given keyword. I want to save a csv file of the URL as one column and the corresponding contents from the list item in the second column. In this case, it's albums that I'd like to create a table of by who mastered the album, produced the album, etc.

Given a snippet of html: from https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls

...
<li>
<span class="entity_1XpR8">Recorded By</span>
" – " 
<a href="/label/394786-EMI-Studios-Paris" hreflang="en" class="link_1ctor">EMI Studios, Paris</a>
</li>

<li>
<span class="entity_1XpR8">Mastered At</span>
 " – " 
<a href="/label/264060-Sterling-Sound" hreflang="en" class="link_1ctor">Sterling Sound</a>
</li>


etc etc etc

...

My code so far is something like:

import requests
import pandas as pd
from bs4 import BeautifulSoup

results = []
kw = "Mastered At"

with open("urls.txt") as file:
    for line in file:
        url = line.rstrip()
        source = requests.get(url).text
        soup = BeautifulSoup(source, "html.parser")
        x = soup.find_all('span', string='Mastered At')
        results.append((url, x))

print(results)
df = pd.DataFrame(results)
df.to_csv('mylist1.csv')

With some modifications based on comments below, still having issues: As you can see I'm trying to do this within a for loop for each link in a list.

The URL list is a simple text file with separate lines for each. Since I'm scraping only one website the sources, class names, and etc should be the same, but the dish will change from page to page.

ex URL list:

https://www.discogs.com/release/7896531-The-Rolling-Stones-Some-Girls
https://www.discogs.com/release/3872976-Pink-Floyd-The-Wall
... etc etc etc

updated code snippet:

import requests
import pandas as pd
from bs4 import BeautifulSoup

results = []

with open("urls.txt") as file:
    for line in file:
        url = line.rstrip()
        print(url)
        source = requests.get(url).text
        soup = BeautifulSoup(source, "html.parser")
        for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Mastered At']:
            results.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(),
                            x.select_one('span.spClass').text.strip()))

df = pd.DataFrame(results, columns=['Url', 'Mastered At', 'Studio'])

print(df)
df.to_csv('studios.csv')

I'm hoping the output in this case is Col 1: (url from txt file); Col 2: "Mastered At — Sterling Sound" (or just "Sterling Sound"), but for each page in the list because these items vary from page to page. I will change the keyword to extract different list items accordingly. In the end I'd like one big spreadsheet with the full list or the url and corresponding item side by side something like below:

example:
album url | Sterling Sound
album url | Abbey Road
album url | Abbey Road
album url | Sterling Sound
album url | Real World Studios
album url | EMI Studios, Paris
album url | Sterling Sound

etc etc etc

Thanks for your help! Cheers.

Upvotes: 2

Views: 1035

Answers (2)

Artur Chukhrai
Artur Chukhrai

Reputation: 11

The Beautiful Soup library is best suited for this task.

You can use the following code to extract data:

import requests, lxml
from bs4 import BeautifulSoup

# urls.html would be better
with open("urls.txt") as file:
    src = file.read()

soup = BeautifulSoup(src, 'lxml')

for first, second in zip(soup.select("li span"), soup.select("li a")):
    print(first)
    print(second)

To find the desired selector, you can use the select() bs4 method. This method accepts a selector to search for and returns a list of all matched HTML elements.

In this case, I use the zip() built-in function, which allows you to go through two structures at once in one cycle.

Then you can use the data for your tasks.

Upvotes: 1

Barry the Platipus
Barry the Platipus

Reputation: 10460

BeautifulSoup can use different parsers for html. If you have issues with lxml you can try others, like html.parser. You can try the following code, it will create a dataframe from your data, which can then be further saved to csv or other formats:

from bs4 import BeautifulSoup
import pandas as pd

html = '''
<li>
<span class = "spClass">Breakfast</span> " — "
<a href="/examplepage/Pancakes" class="linkClass">Pancakes</a>
</li>

<li>
<span class = "spClass">Lunch</span> " — "
<a href="/examplepage/Sandwiches" class="linkClass">Sandwiches</a>
</li>

<li>
<span class = "spClass">Dinner</span> " — "
<a href="/examplepage/Stew" class="linkClass">Stew</a>
</li>

'''

soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in soup.select('li'):
    df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
    
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])
print(df) ## you can save the dataframe as csv like so: df.to_csv('foods.csv')

Result:

Url Food    Type
0   /examplepage/Pancakes   Pancakes    Breakfast
1   /examplepage/Sandwiches Sandwiches  Lunch
2   /examplepage/Stew   Stew    Dinner

EDIT: If you only want to extract specific li tags, as per your comment, you can do:

soup = BeautifulSoup(html, 'html.parser')
df_list = []
for x in [x for x in soup.select('li') if x.select_one('span.spClass').text.strip() == 'Dinner']:
    df_list.append((x.select_one('a.linkClass').get('href'), x.select_one('a.linkClass').text.strip(), x.select_one('span.spClass').text.strip()))
    
df = pd.DataFrame(df_list, columns=['Url', 'Food', 'Type'])

And this will return:

Url Food    Type
0   /examplepage/Stew   Stew    Dinner

Upvotes: 0

Related Questions