Reputation: 81
I've built a web-scraper that extracts all images on a website. My code is supposed to print each img URL to the standard output and write a csv file with all of these, but right now it is only writing the last image found to the file and the number of that result to the csv.
Here's the code I'm currently using:
# This program prints a list of all images contained in a web page
#imports library for url/html recognition
from urllib.request import urlopen
from HW_6_CSV import writeListToCSVFile
#imports library for regular expressions
import re
#imports for later csv writing
import csv
#gets user input
address = input("Input a url for a page to get your list of image urls ex. https://www.python.org/: ")
#opens Web Page for processing
webPage = urlopen(address)
#defines encoding
encoding = "utf-8"
#defines resultList variable
resultList=[]
#sets i for later printing
i=0
#defines logic flow
for line in webPage :
line = str(line, encoding)
#defines imgTag
imgTag = '<img '
#goes to next piece of logical flow
if imgTag in line :
i = i+1
srcAttribute = 'src="'
if srcAttribute in line:
#parses the html retrieved from user input
m = re.search('src="(.+?)"', line)
if m:
reline = m.group(1)
#prints results
print("[ ",[i], reline , " ]")
data = [[i, reline]]
output_file = open('examp_output.csv', 'w')
datawriter = csv.writer(output_file)
datawriter.writerows(data)
output_file.close()
webPage.close()
How do I get this program to write all of the images found to a CSV file?
Upvotes: 0
Views: 405
Reputation: 13459
You're only seeing the last result in your csv, because data
is never properly updated within the scope of the for-loop: you're only writing to it once, when you've exited the loop. To get all the relevant pieces of the HTML added to your list data
, you should indent that line and use the append
or extend
method of the list.
So if you'd rewrite the loop as:
img_nbr = 0 # try to avoid using `i` as the name of an index. It'll save you so much time if you ever find you need to replace this identifier with another one if you chose a better name
data = []
imgTag = '<img ' # no need to redefine this variable each time in the loop
srcAttribute = 'src="' # same comment applies here
for line in webPage:
line = str(line, encoding)
if imgTag in line :
img_nbr += 1 # += saves you typing a few keystrokes and a possible future find-replace.
#if srcAttribute in line: # this check and the next do nearly the same: get rid of one
m = re.search('src="(.+?)"', line)
if m:
reline = m.group(1)
print("[{}: {}]".format(img_nbr, reline)) # `format` is the suggested way to build strings. It's been around since Python 2.6.
data.append((img_nbr, reline)) # This is what you really missed.
you'll get better results. I've added a few comments to give some suggestions for your coding skills and removed your comments to make the new ones stand out.
However, your code still has a few problems: HTML should not be parsed with regular expressions unless the source code is extremely well-structured (and even then...). Now, because you are asking the user for input, they could give any url, and the webpage will more often than not be poorly structured. I suggest you to have a look into BeautifulSoup if you'd like to build more robust web-scrapers.
Upvotes: 1