Reputation: 451
I am trying to scrape urls from the html format website. I use beautiful soup. Here's a part of the html.
<li style="display: block;">
<article itemscope itemtype="http://schema.org/Article">
<div class="col-md-3 col-sm-3 col-xs-12" >
<a href="/stroke?p=3083" class="article-image">
<img itemprop="image" src="/FileUploads/Post/3083.jpg?w=300&h=160&mode=crop" alt="Banana" title="Good for health">
</a>
</div>
<div class="col-md-9 col-sm-9 col-xs-12">
<div class="article-content">
<a href="/stroke">
<img src="/assets/home/v2016/img/icon/stroke.png" style="float:left;margin-right:5px;width: 4%;">
</a>
<a href="/stroke?p=3083" class="article-title">
<div>
<h4 itemprop="name" id="playground">
Banana Good for health </h4>
</div>
</a>
<div>
<div class="clear"></div>
<span itemprop="dateCreated" style="font-size:10pt;color:#777;">
<i class="fa fa-clock-o" aria-hidden="true"></i>
09/10 </span>
</div>
<p itemprop="description" class="hidden-phone">
<a href="/stroke?p=3083">
I love Banana.
</a>
</p>
</div>
</div>
</article>
</li>
My code:
from bs4 import BeautifulSoup
re=requests.get('http://xxxxxx')
bs=BeautifulSoup(re.text.encode('utf-8'), "html.parser")
for link in bs.find_all('a') :
if link.has_attr('href'):
print (link.attrs['href'])
The result will print out all the urls from this page, but this is not what I am looking for, I only want a particular one like "/stroke?p=3083" in this example how can I set the condition in python? (I know there are totally three "/stroke?p=3083" in this, but I just need one)
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)
Upvotes: 1
Views: 2075
Reputation: 22440
Just put there a link in the scraper replacing some_link
and give it a go. I suppose you will have your desired link along with it's full form.
import requests
from bs4 import BeautifulSoup
from urllib.parse import urljoin
res = requests.get(some_link).text
soup = BeautifulSoup(res,"lxml")
for item in soup.select(".article-image"):
print(urljoin(some_link,item['href']))
Upvotes: 2
Reputation: 166
You are getting most of it right already. Collect the links as follows (just a list comprehension version of what you are doing already)
urls = [url for url in bs.findall('a') if url.has_attr('href')]
This will give you the urls. To get one of them, and append it to the abcde url you could simply do the following:
if urls:
new_url = 'http://www.abcde.com{}'.format(urls[0])
Upvotes: 0
Reputation: 81
Another question. This url is not complete, I need to combine them with "http://www.abcde.com" so the result will be "http://www.abcde.com/stroke?p=3083". I know I can use paste in R, but how to do this in Python? Thanks in advance! :)
link = 'http://abcde.com' + link
Upvotes: 0