Reputation: 745

Why is this loop returning twice?

I have the following code:

import re
from bs4 import BeautifulSoup

f = open('AIDNIndustrySearchAll.txt', 'r')
g = open('AIDNurl.txt', 'w')
t = f.read()
soup = BeautifulSoup(t)

list = []
counter = 0

for link in soup.find_all("a"):
    a = link.get('href')
    if re.search("V", a) != None:
        list.append(a)
        counter = counter + 1

new_list = ['http://www.aidn.org.au/{0}'.format(i) for i in list]
output = "\n".join(i for i in new_list)

g.write(output)

print output
print counter

f.close()
g.close()

It is basically going through a saved HTML page and pulling the links I am interested in. I am new to Python, so I am sure the code is terrible but it is (almost) working ;)

The current issue is that it is returning two copies of each link, not one. I am sure it has something to do with the way the loop is set up but am a bit stuck.

I welcome any help on this question (I can provide more details if required - such as HTML and more information on links I am looking for) as well as any general code improvements so I can learn as much as possible.

Upvotes: 1

Answers (3)

Shawn Chin

Reputation: 86974

As others have pointed out in comments, your loop looks OK so the repetition is likely to be in the HTML itself. If you can share a link to the HTML file perhaps we could be of more assistance.

As for general code improvements, here's how I might approach this:

from bs4 import BeautifulSoup
soup = BeautifulSoup(open('AIDNIndustrySearchAll.txt', 'r'))

# create a generator that returns actual href entries
links = (x.get('href') for x in soup.find_all('a'))

# filter the links to only those that contain "V" and store it as a 
# set to remove duplicates
selected = set(a for a in links if "V" in a)

# build output string using selected links
output = "\n".join('http://www.aidn.org.au/{0}'.format(a) for a in selected)

# write the string to file
with open('AIDNurl.txt', 'w') as f:
  f.write(output)

print output
print len(selected)  # print number of selected links

Upvotes: 2

Fabian

Reputation: 4348

Since you've asked for code optimations too, I will post my suggestions as an answer. Feel free!

from bs4 import BeautifulSoup

f = open('AIDNIndustrySearchAll.txt', 'r')
t = f.read()
f.close()

soup = BeautifulSoup(t)
results = []   ## 'list' is a built-in type and shouldn't be used as variable name

for link in soup.find_all('a'):
    a = link.get('href')
    if 'V' not in a:
        results.append(a)

formatted_results = ['http://www.aidn.org.au/{0}'.format(i) for i in results]
output = "\n".join(formatted_results)

g = open('AIDNurl.txt', 'w')
g.write(output)
g.close()

print output
print len(results)

This still doesn't fix your original problem, see my and other peoples question comments.

Upvotes: 2

RParadox

Reputation: 6891

Find_all returns a list of all elements. If you just want the first one you could do this: for link in soup.find_all("a")[:1]: . It's not quite clear why the list is a copy of links. You can use print statements to get a better insight into the code. Print the list and the length of the list, etc. Or you can use pdb

Upvotes: 0

Why is this loop returning twice?

Answers (3)

Related Questions