FlyingZebra1
FlyingZebra1

Reputation: 1346

Using Python + BeautifulSoup to extract tags in tandem, creating a list of lists

I'm a bit new to python/BeautifulSoup, and was wondering if I could get some direction on how to get the following accomplished.

I have html from a webpage, that is structured as follows:

1) block of code contained within a tag that contains all image names (Name1, Name2, Name3.

2) block of code contained within a tag that has image urls.

3) a date, that appears one on the webpage. I put it into 'date' variable (this has already been extracted)

From the code, I'm trying to extract a list of lists that will contain [['image1','url1', 'date'], ['image2','url2','date']], which i will later convert into a dictionary (via dict(zip(labels, values)) function), and insert into a mysql table.

All I can come up with is how to extract two lists that contain all images , and all url's. Any idea on how to get what i'm trying to do accomplished?

Few things to keep in mind:

1) number of images always changes, along with names (1:1)

2) date always appears once.

P.S. Also, if there is a more elegant way to extract the data via bs4, please let me know!

from bs4 import BeautifulSoup
name = []
url = []
date = '2017-10-12'

text = '<div class="tabs"> <ul><li> NAME1</li><li> NAME2</li><li> NAME3</li> </ul> <div><div><div class="img-wrapper"><img alt="" src="www.image1.com/1.jpg" title="image1.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/1.jpg); w.print();"> Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image2.com/2.jpg" title="image2.jpg"></img> </div> <center><a class="button print" href="javascript: w=window.open("www.image2.com/2.jpg"); w.print();">Print</a> </center></div><div> <div class="img-wrapper"><img alt="" src="www.image1.com/3.jpg" title="image3.jpg"></img></div> <center><a class="button print" href="javascript: w=window.open("www.image1.com/3.jpg"); w.print();"> Print</a> </center></div> </div></div>'
soup = BeautifulSoup(text, 'lxml')
#print soup.prettify()
#get names
for imgz in soup.find_all('div', attrs={'class':'img-wrapper'}):
    for imglinks in imgz.find_all('img', src = True): 
        #print imgz
        url.append((imglinks['src']).encode("utf-8"))
#3 get ad URLS
for ultag in soup.find_all('ul'):
    for litag in ultag.find_all('li'): 
        name.append((litag.text).encode("utf-8")) #dump all urls into a list
print url
print name

Upvotes: 2

Views: 756

Answers (1)

Brad Solomon
Brad Solomon

Reputation: 40918

Here's another possible route to pulling the urls and names:

url = [tag.get('src') for tag in soup.find_all('img')]
name = [tag.text.strip() for tag in soup.find_all('li')]

print(url)
# ['www.image1.com/1.jpg', 'www.image2.com/2.jpg', 'www.image1.com/3.jpg']

print(name)
# ['NAME1', 'NAME2', 'NAME3']

As for ultimate list creation, here's something that's functionally similar to what @t.m.adam has suggested:

print([pair + [date] for pair in list(map(list, zip(url, name)))])
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'],
#  ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'],
#  ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]

Note that map is pretty infrequently used nowadays and its use is outright discouraged in some places.

Or:

n = len(url)
print(list(map(list, zip(url, name, [date] * n))))
# [['www.image1.com/1.jpg', 'NAME1', '2017-10-12'], ['www.image2.com/2.jpg', 'NAME2', '2017-10-12'], ['www.image1.com/3.jpg', 'NAME3', '2017-10-12']]

Upvotes: 1

Related Questions