Reputation: 3
I am attempting to write a python web crawler and make it multithreaded. The main issue I am having is running the code concurrently using the ThreadPoolExecutor library.
def crawl(self, url):
for link in self.get_links(url):
if link in self.visited:
continue
print("Scraping URL: {}".format(link))
#if not visited add to visited set O(1) time
self.visited.add(link)
info = self.extract_info(link)
return("word")
my crawl function just wants to return some string.
I have a starter function which starts the pool with a max of 2 workers:
def start(self):
job = self.pool.submit(self.crawl(self.startingUrl))
job.add_done_callback(self.appendText)
The problem arises in appendText function where I am trying to convert the future object back to a string to write the string to a file:
def appendText(self,res):
print("HELLO!")
print("res = ", res.result())
with open("Crawled.txt","w") as file:
des = "Description: {}".format(res.result())
key = "Keywords:{}".format(res.result())
file.write(des)
file.write(key)
I ended up getting a TypeError and have been looking to understand how to get the future object converted to a string
HELLO! tures
Traceback (most recent call last):
File "crawler/crawler.py", line 78, in <module>
crawler.start()
File "crawler/crawler.py", line 73, in start
job.add_done_callback(self.appendText)
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 403, in add_done_callback
fn(self)
File "crawler/crawler.py", line 53, in appendText
print("res = ", res.result())
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 425, in result
return self.__get_result()
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 384, in __get_result
raise self._exception
File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/thread.py", line 57, in run
result = self.fn(*self.args, **self.kwargs)
TypeError: 'str' object is not callable
Where am I going wrong in this? Thank you!
Upvotes: 0
Views: 1136
Reputation: 45816
crawl
returns a string. With how you have your code now, you're calling crawl
, then giving that string that was returned to submit
. The pool will then attempt to execute that string you gave it as a function, thus the error.
You want to pass the function uncalled to submit
, and have it call crawl
for you:
self.pool.submit(target=self.crawl, args=(self.startingUrl,))
target
is the function you want it to call, and args
are the arguments that you want it to call the function with.
You could also use this roughly equivalent way:
self.pool.submit(target=lambda: self.crawl(self.startingUrl))
By wrapping it in a lambda
, you can dekat execution. Prefer the first way though as lambda
has some overhead. I'm including it here though for reference.
Upvotes: 1