Reputation: 353

BeautifulSoup use select multiple times

My problem is related to this answer.

I have following code:

import urllib.request
from bs4 import BeautifulSoup

time = 0

html = urllib.request.urlopen("https://www.kramerav.com/de/Product/VM-2N").read()
html2 = urllib.request.urlopen("https://www.kramerav.com/de/Product/SDIA-IN2-F16").read()
try:
    div = str(BeautifulSoup(html).select("div.large-image")[0])
    if(str(BeautifulSoup(html).select("div.large-image")[1]) != ""):
        div += str(BeautifulSoup(html).select("div.large-image")[1])
    time = time + 1
except IndexError:
    div = ""
    time = time + 1
finally:
    print(str(time) + div)

The site of the variable html has 2 div-classes named "large-image". The site of the variable html2 only has 1. With html the program works as intended. But if I switch to html2 the variable div is going to be completely empty.

I would like to save the 1 div-class rather than saving nothing. How could I archieve this?

Upvotes: 1

Answers (2)

furas

Reputation: 142889

You can concatenate all items inside for loop

    all_divs = soup.select("div.large-image")

    for item in all_divs:
        div += str(item)
        time += 1

or using join()

    time = len(all_divs)

    div = ''.join(str(item) for item in all_divs)

You can also write in file directly inside for loop and you get to row

    for item in all_divs:
        csv_writer.writerow( [str(item).strip()] )
        time += 1

Working example

import urllib.request
from bs4 import BeautifulSoup
import csv

div = ""
time = 0

f = open('output.csv', 'w')
csv_writer = csv.writer(f)

all_urls = [
  "https://www.kramerav.com/de/Product/VM-2N",
  "https://www.kramerav.com/de/Product/SDIA-IN2-F16",
]

for url in all_urls:
    print('url:', url)

    html = urllib.request.urlopen(url).read()

    try:
        soup = BeautifulSoup(html)
        all_divs = soup.select("div.large-image")

        for item in all_divs:
            div += str(item)
            time += 1

        # or     
        time = len(all_divs)
        div = ''.join(str(item) for item in all_divs)

        # or

        for item in all_divs:
            #div += str(item)
            #time += 1
            csv_writer.writerow( [time, str(item).strip()] )

    except IndexError as ex:
        print('Error:', ex)
        time += 1
    finally:
        print(time, div)

f.close()

Upvotes: 0

J_H

Reputation: 20540

the variable div is going to be completely empty.

That's because your error handler assigned it the empty string.

Please don't use subscripts, conditionals, and handlers in that way. It would be more natural to iterate over the results of select() with for, building up a result list (or string).

Also, you should create soup = BeautifulSoup(html) just once, as that can be a fairly expensive operation, since it carefully parses a potentially long web page. With that, you could build up a list of HTML fragments with:

    images = [image
              for image in soup.select('div.large-image')]

Or if for some reason you're not fond list comprehensions, you could equivalently write:

    images = []
    for image in soup.select('div.large-image'):
        images.append(image)

and then get the required html with div = '\n'.join(images).

Upvotes: 1

BeautifulSoup use select multiple times

Answers (2)

Related Questions