Reputation: 71
So I would like to get a list of download links from a page:
soup = BeautifulSoup(driver.page_source)
linky=soup.find_all(name='a', href=re.compile('download.php'))
This returns me a list of all the links:
[<a href="download.php/947983/adam.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Adam"/></a>,
<a href="download.php/947981/barb.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Barb"/></a>,
<a href="download.php/947972/chris.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Chris"/></a>,
<a href="download.php/947971/dan.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Dan"/></a>]
I would like to extract the href link, and the img title after the "Download", and then put them into tuples.
So I would have a list like the following:
[(download.php/947983/adam.zip, Adam)
(download.php/947981/barb.zip, Barb),
(download.php/947972/chris.zip, Chris),
(download.php/947971/dan.zip, Dan)]
I thought I could just split the text between href=" and "img for each item, but then I would have no idea how to do that, and the next problem is how would I also extract the title as well?
Upvotes: 1
Views: 30
Reputation: 990
Here is a solution to your problem, let say we have a list_of_names containing your input links and names that you want to extract, then links and names can be extracted using code given below:
#!/usr/bin/python
import re
list_of_names= ['<a href="download.php/947983/adam.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Adam"/></a>',
'<a href="download.php/947981/barb.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Barb"/></a>',
'<a href="download.php/947972/chris.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Chris"/></a>',
'<a href="download.php/947971/dan.zip"><img "="" alt="Download" src="browse_dl.png" style="style=" title="Download Dan"/></a>']
links=[]
names=[]
for row in list_of_names:
links.append([x.strip() for x in re.split(r"href=\"(.*)\"><img", row)][1])
names.append([x.strip() for x in re.split(r"title=\"Download (.*)\"\/>", row)][1])
desired_list=list(tuple(zip(links,names)))
print(desired_list)
If you compile this script then you can get your desired output:
python -i code_for_desired_output.py
[('download.php/947983/adam.zip', 'Adam'), ('download.php/947981/barb.zip', 'Barb'), ('download.php/947972/chris.zip', 'Chris'), ('download.php/947971/dan.zip', 'Dan')]
Upvotes: 1