Reputation: 1346
I'm a bit new to python/BeautifulSoup, and was wondering if I could get some direction on how to get the following accomplished.
I have html from a webpage, that is structured as follows:
1) block of code contained within a tag that contains all image names (Name1, Name2, Name3.
2) block of code contained within a tag that has image urls.
3) a date, that appears one on the webpage. I put it into 'date' variable (this has already been extracted)
From the code, I'm trying to extract a list of lists that will contain [['image1','url1', 'date'], ['image2','url2','date']], which i will later convert into a dictionary (via dict(zip(labels, values)) function), and insert into a mysql table.
All I can come up with is how to extract two lists that contain all images , and all url's. Any idea on how to get what i'm trying to do accomplished?
Few things to keep in mind:
1) number of images always changes, along with names (1:1)
2) date always appears once.
P.S. Also, if there is a more elegant way to extract the data via bs4, please let me know!
from bs4 import BeautifulSoup
name = []
url = []
date = '2017-10-12'
text = '<div class="tabs"> <ul><li> NAME1</li><li> NAME2</li><li> NAME3</li> </ul> <div><div><div class="img-wrapper"><img alt="" src="www.image1.com/1.jpg" title="image1.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/1.jpg); w.print();"> Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image2.com/2.jpg" title="image2.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image2.com/2.jpg"); w.print();">Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image1.com/3.jpg" title="image3.jpg"></img></div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/3.jpg"); w.print();"> Print</a> </center></div> </div></div>'
soup = BeautifulSoup(text, 'lxml')
#print soup.prettify()
#get names
for imgz in soup.find_all('div', attrs={'class':'img-wrapper'}):
for imglinks in imgz.find_all('img', src = True):
#print imgz
url.append((imglinks['src']).encode("utf-8"))
#3 get ad URLS
for ultag in soup.find_all('ul'):
for litag in ultag.find_all('li'):
name.append((litag.text).encode("utf-8")) #dump all urls into a list
print url
print name
Upvotes: 2
Views: 756
Reputation: 40918
Here's another possible route to pulling the urls and names:
url = [tag.get('src') for tag in soup.find_all('img')]
name = [tag.text.strip() for tag in soup.find_all('li')]
print(url)
# ['www.image1.com/1.jpg', 'www.image2.com/2.jpg', 'www.image1.com/3.jpg']
print(name)
# ['NAME1', 'NAME2', 'NAME3']
As for ultimate list creation, here's something that's functionally similar to what @t.m.adam has suggested:
print([pair + [date] for pair in list(map(list, zip(url, name)))])
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'],
# ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'],
# ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]
Note that map
is pretty infrequently used nowadays and its use is outright discouraged in some places.
Or:
n = len(url)
print(list(map(list, zip(url, name, [date] * n))))
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'], ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'], ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]
Upvotes: 1