Reputation: 745
I have the following code:
import re
from bs4 import BeautifulSoup
f = open('AIDNIndustrySearchAll.txt', 'r')
g = open('AIDNurl.txt', 'w')
t = f.read()
soup = BeautifulSoup(t)
list = []
counter = 0
for link in soup.find_all("a"):
a = link.get('href')
if re.search("V", a) != None:
list.append(a)
counter = counter + 1
new_list = ['http://www.aidn.org.au/{0}'.format(i) for i in list]
output = "\n".join(i for i in new_list)
g.write(output)
print output
print counter
f.close()
g.close()
It is basically going through a saved HTML page and pulling the links I am interested in. I am new to Python, so I am sure the code is terrible but it is (almost) working ;)
The current issue is that it is returning two copies of each link, not one. I am sure it has something to do with the way the loop is set up but am a bit stuck.
I welcome any help on this question (I can provide more details if required - such as HTML and more information on links I am looking for) as well as any general code improvements so I can learn as much as possible.
Upvotes: 1
Views: 457
Reputation: 86974
As others have pointed out in comments, your loop looks OK so the repetition is likely to be in the HTML itself. If you can share a link to the HTML file perhaps we could be of more assistance.
As for general code improvements, here's how I might approach this:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('AIDNIndustrySearchAll.txt', 'r'))
# create a generator that returns actual href entries
links = (x.get('href') for x in soup.find_all('a'))
# filter the links to only those that contain "V" and store it as a
# set to remove duplicates
selected = set(a for a in links if "V" in a)
# build output string using selected links
output = "\n".join('http://www.aidn.org.au/{0}'.format(a) for a in selected)
# write the string to file
with open('AIDNurl.txt', 'w') as f:
f.write(output)
print output
print len(selected) # print number of selected links
Upvotes: 2
Reputation: 4348
Since you've asked for code optimations too, I will post my suggestions as an answer. Feel free!
from bs4 import BeautifulSoup
f = open('AIDNIndustrySearchAll.txt', 'r')
t = f.read()
f.close()
soup = BeautifulSoup(t)
results = [] ## 'list' is a built-in type and shouldn't be used as variable name
for link in soup.find_all('a'):
a = link.get('href')
if 'V' not in a:
results.append(a)
formatted_results = ['http://www.aidn.org.au/{0}'.format(i) for i in results]
output = "\n".join(formatted_results)
g = open('AIDNurl.txt', 'w')
g.write(output)
g.close()
print output
print len(results)
This still doesn't fix your original problem, see my and other peoples question comments.
Upvotes: 2
Reputation: 6891
Find_all returns a list of all elements. If you just want the first one you could do this: for link in soup.find_all("a")[:1]:
. It's not quite clear why the list is a copy of links. You can use print statements to get a better insight into the code. Print the list and the length of the list, etc. Or you can use pdb
Upvotes: 0