Ian Spitz
Ian Spitz

Reputation: 311

Appending an item to a list using multiprocessing in Python

In got this block of code:

def get_spain_accomodations():
    pool = Pool()
    links = soup.find_all('a', class_="hotel_name_link url")
    pool.map(get_page_links, links)

    #for a in soup.find_all('a', class_="hotel_name_link url"):
    #    hotel_url = "https://www.booking.com" + a['href'].strip()
    #    hotels_url_list.append(hotel_url)

def get_page_links(link):
     hotel_url = "https://www.booking.com" + link['href'].strip()
     hotels_url_list.append(hotel_url)

For some reason the hotel_url is not being appended to the list. If I try with the commented loop it actually works, but not with the map() function. I also printed hotel_url for each get_page_links call and it worked. I have no idea what is going on. Below are the function callings.

init_BeautifulSoup()
get_spain_accomodations()
#get_hotels_wifi_rating()

for link in hotels_url_list:
    print link

The code is executed without errors but the link list is not being printed.

Upvotes: 0

Views: 3272

Answers (1)

Savir
Savir

Reputation: 18418

It's important to understand that processes run in isolated areas of memory. Each process will have their own instance of hotels_url_list and there's no (easy) way of "sticking" those values into the parent process' list: if in the parent process you create an instance of list, that instance is not the same that the subprocesses use: When you do a .fork() (a.k.a. create a subprocess), the memory of the parent process is cloned on the child process. So, if the parent had an instance of list in the hotels_url_list variable, you'll also have an instance of list (also called hotels_url_list) in the child process BUT they will not be the same (they'll occupy different areas in memory).

This doesn't happen with Threads. They do share memory.

I would say (it's not like I'm much of an expert here) that the canonical way of communicating processes in this case would be a Queue: The child process puts things in the queue, the parent process grabs them:

from multiprocessing import Process, Queue


def get_spain_accomodations():
    q = Queue()
    processes = []
    links = ['http://foo.com', 'http://bar.com', 'http://baz.com']
    hotels_url_list = []
    for link in links:
        p = Process(target=get_page_links, args=(link, q,))
        p.start()
        processes.append(p)
    for p in processes:
        p.join()
        hotels_url_list.append(q.get())
    print("Collected: %s" % hotels_url_list)


def get_page_links(link, q):
    print("link==%s" % link)
    hotel_url = "https://www.booking.com" + link
    q.put(hotel_url)


if __name__ == "__main__":
    get_spain_accomodations()

This outputs each link prepended with https://www.booking.com, the pre-pending happening on independent processes:

link==http://foo.com
link==http://bar.com
link==http://baz.com
Collected: ['https://www.booking.comhttp://foo.com', 'https://www.booking.comhttp://bar.com', 'https://www.booking.comhttp://baz.com']

I don't know if it will help you, but to me, it helps seeing the Queue as a "shared file" that both processes know about. Imagine you have two complete different programs, and one of them knows that has to write things into a file called /tmp/foobar.txt and the other one knows that has to read from a file called /tmp/foobar.txt. That way they can "communicate" with each other. This paragraph is just a "metaphor" (although that's pretty much how Unix pipes work)... Is not like queues work exactly like that, but maybe it helps understanding the concept? Dunno, really, maybe I made it more confusing...

Another way would be using Threads and collect their return value, as explained here.

Upvotes: 1

Related Questions