Reputation: 11
I have a sitemap with about 21 urls on it and each of those urls contains about 2000 more urls. I'm trying to write something that will allow me to parse each of the original 21 urls and grab their containing 2000 urls then append it to a list.
I've been bashing my head against a wall for a few days now trying to get this to work, but it keeps returning a list of 'None'. I've only been working with python for about a 3 weeks now, so I might be missing something really obvious. Any help would be great!
storage = []
storage1 = []
for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)
def parser(any):
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
storage2 = [parser(x) for x in storage]
I also tried using a while loop with a counter, but it always stopped after the first 2000 urls.
Upvotes: 1
Views: 126
Reputation: 1271
If I understand your problem correctly, you have two stages in your program:
Your first step could look like this:
initial_urls = [('http://...%s...' % x) for x in range(21)]
Then, to populate the large list of URLs from the pages, you could do something like this:
big_list = []
def extract_urls(source):
tree = ET.parse(urlopen(any))
for link in get_links(tree):
big_list.append(link.attrib['href'])
def get_links(tree):
... - define the logic for link extraction here
for url in initial_urls:
extract_urls(url)
print big_list
Note that you'll have to write the procedure that extracts the links from the document yourself.
Hope this helps!
Upvotes: 1
Reputation: 2476
You have to return storage1 in the parser function
def parser(any):
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
return storage1
I think this is what you want.
Upvotes: 0
Reputation: 2311
If you don't declare a return for a function in python, it automatically returns None
. Inside parser
you're adding elements to storage1
, but aren't returning anything. I would give this a shot instead.
storage = []
for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)
def parser(any):
storage1 = []
tree = ET.parse(urlopen(any))
root = tree.getroot()
for i in range(len(storage)):
x = (root[i][0]).text
storage1.append(x)
return storage1
storage2 = [parser(x) for x in storage]
EDIT: As Amber said, you should also see that all your elements were actually being stored in storage1
.
Upvotes: 1
Reputation: 526583
parser()
never return
s anything, so it defaults to returning None
, hence why storage2
contains a list of None
s. Perhaps you want to look at what's in storage1
?
Upvotes: 1