Reputation: 35
I want to extract text from the website and the format is like this:
<a href="#N44">Avalon</a>
<a href="#N36">Avondale</a>
<a href="#N4">Bacon Park Area</a>
How do I just select those 'a' tags with href="#N" because there are several more?
I tried creating a list to iterate through but when I try the code, it selects only one element.
loc= ['#N0', '#N1', '#N2', '#N3', '#N4', '#N5'.....'#N100']
for i in loc:
name=soup.find('a', attrs={'href':i})
print(name)
I get
<a href="#N44">Avalon</a>
not
<a href="#N44">Avalon</a>
<a href="#N36">Avondale</a>
<a href="#N4">Bacon Park Area</a
How about just?
Avalon
Avondale
Bacon Park Area
Thanks in advance!
Upvotes: 0
Views: 175
Reputation: 169304
You're iterating over the items, but not putting them anywhere. So when you are done with your loop all that's left in name
is the last item.
You can put them in a list like below, and access the .text
attribute to get just the name from the tag:
names = []
for i in loc:
names.append(soup.find('a',attrs={'href':i}).text)
Result:
In [15]: names
Out[15]: ['Bacon Park Area', 'Avondale', 'Avalon']
If you want to leave out the first list's creation you can just do:
import re
names = [tag.text for tag in soup.find_all('a',href=re.compile(r'#N\d+'))]
In a regular expression, the \d
means digit and the +
means one or more instances of.
Upvotes: 1