Scraping images using beautiful soup

Question

I'm trying to scrape the image from an article using beautiful soup. It seems to work but I can't open the image. I get a file format error every time I try to access the image from my desktop. Any insights?

timestamp = time.asctime() 

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Create a new file to write content to
txt = open('%s.jpg' % timestamp, "wb")

# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
    link = link["src"].split("src=")[-1]
    download_img = urllib2.urlopen(link)
    txt.write('
' + "Image(s): " + download_img.read() + '
' + '
')

txt.close()

user764357 · Accepted Answer

You are appending a new line and text to the start of the data for every image, essentially corrupting it.

Also, you are writing every image into the same file, again corrupting them.

Put the logic for writing the files inside the loop, and don't add any extra data to the images and it should work fine.

# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
    timestamp = time.asctime() 
    txt = open('%s.jpg' % timestamp, "wb")
    link = link["src"].split("src=")[-1]
    download_img = urllib2.urlopen(link)
    txt.write(download_img.read())

    txt.close()

Scraping images using beautiful soup

Answers (1)

Related Questions