Reputation: 69
I have written a code which extracts certain text from a specified url, but it gives me 2 or 3(depending on the webpage) subsequent same output in different lines. I just need to use the first output. How should I do that? This is my code:-
import requests, re
from bs4 import BeautifulSoup
url="http://www.barneys.com/raf-simons-%22boys%22-poplin-shirt-504182589.html#start=2"
r=requests.get(url)
soup=BeautifulSoup(r.content)
links=soup.find_all("a")
g_d4=soup.find_all("ol", {"class":"breadcrumb"})
for item in g_d4:
links_2=soup.find_all('a', href=re.compile('^http://www.barneys.com/barneys-new-york/men/'))
pattern_2=re.compile("clothing/(\w+)")
for link in links_2:
match_1=pattern_2.search(link["href"])
if match_1:
print (match_1.group(1))
My output is:
shirts
shirts
shirts
I want my output to be like just:
shirts
What should I do?
Upvotes: 0
Views: 42
Reputation: 27861
Not sure which of the answers you need so Ill answer both.
If you want unique results from across the page, you can use sets to do something like:
for item in g_d4:
links_2=soup.find_all('a', href=re.compile('^http://www.barneys.com/barneys-new-york/men/'))
pattern_2=re.compile("clothing/(\w+)")
matches = set()
for link in links_2:
match_1=pattern_2.search(link["href"])
if match_1:
matches.add(match_1.group(1))
print(matches)
If you want just the first result in each iteration, you can break within the inner loop:
for item in g_d4:
links_2=soup.find_all('a', href=re.compile('^http://www.barneys.com/barneys-new-york/men/'))
pattern_2=re.compile("clothing/(\w+)")
for link in links_2:
match_1=pattern_2.search(link["href"])
if match_1:
print(match_1.group(1))
break
Upvotes: 1