Reputation: 311
In got this block of code:
def get_spain_accomodations():
pool = Pool()
links = soup.find_all('a', class_="hotel_name_link url")
pool.map(get_page_links, links)
#for a in soup.find_all('a', class_="hotel_name_link url"):
# hotel_url = "https://www.booking.com" + a['href'].strip()
# hotels_url_list.append(hotel_url)
def get_page_links(link):
hotel_url = "https://www.booking.com" + link['href'].strip()
hotels_url_list.append(hotel_url)
For some reason the hotel_url is not being appended to the list. If I try with the commented loop it actually works, but not with the map() function. I also printed hotel_url for each get_page_links call and it worked. I have no idea what is going on. Below are the function callings.
init_BeautifulSoup()
get_spain_accomodations()
#get_hotels_wifi_rating()
for link in hotels_url_list:
print link
The code is executed without errors but the link list is not being printed.
Upvotes: 0
Views: 3272
Reputation: 18418
It's important to understand that processes run in isolated areas of memory. Each process will have their own instance of hotels_url_list
and there's no (easy) way of "sticking" those values into the parent process' list: if in the parent process you create an instance of list
, that instance is not the same that the subprocesses use: When you do a .fork()
(a.k.a. create a subprocess), the memory of the parent process is cloned on the child process. So, if the parent had an instance of list
in the hotels_url_list
variable, you'll also have an instance of list
(also called hotels_url_list
) in the child process BUT they will not be the same (they'll occupy different areas in memory).
This doesn't happen with Threads. They do share memory.
I would say (it's not like I'm much of an expert here) that the canonical way of communicating processes in this case would be a Queue: The child process puts things in the queue, the parent process grabs them:
from multiprocessing import Process, Queue
def get_spain_accomodations():
q = Queue()
processes = []
links = ['http://foo.com', 'http://bar.com', 'http://baz.com']
hotels_url_list = []
for link in links:
p = Process(target=get_page_links, args=(link, q,))
p.start()
processes.append(p)
for p in processes:
p.join()
hotels_url_list.append(q.get())
print("Collected: %s" % hotels_url_list)
def get_page_links(link, q):
print("link==%s" % link)
hotel_url = "https://www.booking.com" + link
q.put(hotel_url)
if __name__ == "__main__":
get_spain_accomodations()
This outputs each link prepended with https://www.booking.com
, the pre-pending happening on independent processes:
link==http://foo.com
link==http://bar.com
link==http://baz.com
Collected: ['https://www.booking.comhttp://foo.com', 'https://www.booking.comhttp://bar.com', 'https://www.booking.comhttp://baz.com']
I don't know if it will help you, but to me, it helps seeing the Queue as a "shared file" that both processes know about. Imagine you have two complete different programs, and one of them knows that has to write things into a file called /tmp/foobar.txt
and the other one knows that has to read from a file called /tmp/foobar.txt
. That way they can "communicate" with each other. This paragraph is just a "metaphor" (although that's pretty much how Unix pipes work)... Is not like queues work exactly like that, but maybe it helps understanding the concept? Dunno, really, maybe I made it more confusing...
Another way would be using Threads and collect their return value, as explained here.
Upvotes: 1