user3285763
user3285763

Reputation: 149

Scraping images using beautiful soup

I'm trying to scrape the image from an article using beautiful soup. It seems to work but I can't open the image. I get a file format error every time I try to access the image from my desktop. Any insights?

timestamp = time.asctime() 

# Parse HTML of article, aka making soup
soup = BeautifulSoup(urllib2.urlopen(url).read())

# Create a new file to write content to
txt = open('%s.jpg' % timestamp, "wb")

# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
    link = link["src"].split("src=")[-1]
    download_img = urllib2.urlopen(link)
    txt.write('\n' + "Image(s): " + download_img.read() + '\n' + '\n')

txt.close()

Upvotes: 0

Views: 13580

Answers (1)

user764357
user764357

Reputation:

You are appending a new line and text to the start of the data for every image, essentially corrupting it.

Also, you are writing every image into the same file, again corrupting them.

Put the logic for writing the files inside the loop, and don't add any extra data to the images and it should work fine.

# Scrape article main img
links = soup.find('figure').find_all('img', src=True)
for link in links:
    timestamp = time.asctime() 
    txt = open('%s.jpg' % timestamp, "wb")
    link = link["src"].split("src=")[-1]
    download_img = urllib2.urlopen(link)
    txt.write(download_img.read())

    txt.close()

Upvotes: 2

Related Questions