Simi Kareem
Simi Kareem

Reputation: 3

TypeError: 'str' object is not callable in multithreaded web crawler

I am attempting to write a python web crawler and make it multithreaded. The main issue I am having is running the code concurrently using the ThreadPoolExecutor library.

def crawl(self, url):
    for link in self.get_links(url):
      if link in self.visited:
        continue
      print("Scraping URL: {}".format(link))
      #if not visited add to visited set O(1) time
      self.visited.add(link)
      info = self.extract_info(link)
      return("word")

my crawl function just wants to return some string.

I have a starter function which starts the pool with a max of 2 workers:

  def start(self):
    job = self.pool.submit(self.crawl(self.startingUrl))
    job.add_done_callback(self.appendText)

The problem arises in appendText function where I am trying to convert the future object back to a string to write the string to a file:

def appendText(self,res):
    print("HELLO!")
    print("res = ", res.result())

    with open("Crawled.txt","w") as file:
      des = "Description: {}".format(res.result())
      key = "Keywords:{}".format(res.result())
      file.write(des)
      file.write(key)

I ended up getting a TypeError and have been looking to understand how to get the future object converted to a string

HELLO!                                                                               tures
Traceback (most recent call last):
  File "crawler/crawler.py", line 78, in <module>
    crawler.start()
  File "crawler/crawler.py", line 73, in start
    job.add_done_callback(self.appendText)
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 403, in add_done_callback
    fn(self)
  File "crawler/crawler.py", line 53, in appendText
    print("res = ", res.result())
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 425, in result
    return self.__get_result()
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/_base.py", line 384, in __get_result
    raise self._exception
  File "/Library/Frameworks/Python.framework/Versions/3.7/lib/python3.7/concurrent/fu
tures/thread.py", line 57, in run
    result = self.fn(*self.args, **self.kwargs)
TypeError: 'str' object is not callable

Where am I going wrong in this? Thank you!

Upvotes: 0

Views: 1136

Answers (1)

Carcigenicate
Carcigenicate

Reputation: 45816

crawl returns a string. With how you have your code now, you're calling crawl, then giving that string that was returned to submit. The pool will then attempt to execute that string you gave it as a function, thus the error.

You want to pass the function uncalled to submit, and have it call crawl for you:

self.pool.submit(target=self.crawl, args=(self.startingUrl,))

target is the function you want it to call, and args are the arguments that you want it to call the function with.

You could also use this roughly equivalent way:

self.pool.submit(target=lambda: self.crawl(self.startingUrl))

By wrapping it in a lambda, you can dekat execution. Prefer the first way though as lambda has some overhead. I'm including it here though for reference.

Upvotes: 1

Related Questions