user2494409
user2494409

Reputation: 11

How to apply a function on each item in a list

I have a sitemap with about 21 urls on it and each of those urls contains about 2000 more urls. I'm trying to write something that will allow me to parse each of the original 21 urls and grab their containing 2000 urls then append it to a list.

I've been bashing my head against a wall for a few days now trying to get this to work, but it keeps returning a list of 'None'. I've only been working with python for about a 3 weeks now, so I might be missing something really obvious. Any help would be great!

storage = []
storage1 = []

for x in range(21):
url = 'first part of the url' + str(x) + '.xml'
storage.append(url)

def parser(any):
    tree = ET.parse(urlopen(any))
    root = tree.getroot()
    for i in range(len(storage)):
        x = (root[i][0]).text
        storage1.append(x)

storage2 = [parser(x) for x in storage]

I also tried using a while loop with a counter, but it always stopped after the first 2000 urls.

Upvotes: 1

Views: 126

Answers (4)

orlenko
orlenko

Reputation: 1271

If I understand your problem correctly, you have two stages in your program:

  1. You generate initial list of the 21 URLs
  2. You fetch the page at each of those URLs, and extract additional URLs from the page.

Your first step could look like this:

initial_urls = [('http://...%s...' % x) for x in range(21)]

Then, to populate the large list of URLs from the pages, you could do something like this:

big_list = []

def extract_urls(source):
    tree = ET.parse(urlopen(any))
    for link in get_links(tree):
        big_list.append(link.attrib['href'])

def get_links(tree):
    ... - define the logic for link extraction here

for url in initial_urls:
    extract_urls(url)

print big_list

Note that you'll have to write the procedure that extracts the links from the document yourself.

Hope this helps!

Upvotes: 1

rajpy
rajpy

Reputation: 2476

You have to return storage1 in the parser function

def parser(any):
    tree = ET.parse(urlopen(any))
    root = tree.getroot()
    for i in range(len(storage)):
        x = (root[i][0]).text
        storage1.append(x)
    return storage1

I think this is what you want.

Upvotes: 0

FastTurtle
FastTurtle

Reputation: 2311

If you don't declare a return for a function in python, it automatically returns None. Inside parser you're adding elements to storage1, but aren't returning anything. I would give this a shot instead.

storage = []

for x in range(21):
    url = 'first part of the url' + str(x) + '.xml'
    storage.append(url)

def parser(any):
    storage1 = []
    tree = ET.parse(urlopen(any))
    root = tree.getroot()
    for i in range(len(storage)):
        x = (root[i][0]).text
        storage1.append(x)
    return storage1

storage2 = [parser(x) for x in storage]

EDIT: As Amber said, you should also see that all your elements were actually being stored in storage1.

Upvotes: 1

Amber
Amber

Reputation: 526583

parser() never returns anything, so it defaults to returning None, hence why storage2 contains a list of Nones. Perhaps you want to look at what's in storage1?

Upvotes: 1

Related Questions