user3199535
user3199535

Reputation: 65

Usage of python-readability

(https://github.com/buriy/python-readability)

I am struggling using this library and I can't find any documentation for it. (Is there any?)

There are some kind of useable pieces calling help(Document) but there is still something wrong.

My code so far:

from readability.readability import Document
import requests

url = 'http://www.somepage.com'

html = requests.get(url, verify=False).content
readable_article = Document(html,   negative_keywords='test_keyword').summary()

with open('test.html', 'w', encoding='utf-8') as test_file:
    test_file.write(readable_article)

According to the help(Document) output, it should be possible to use a list for the input of the negative_keywords.

readable_article = Document(html, negative_keywords=['test_keyword1', 'test-keyword2').summary()

Gives me a bunch of errors I don't understand:

Traceback (most recent call last): File "/usr/lib/python3.4/site-packages/readability/readability.py", line 163, in summary candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line 300, in score_paragraphs candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 360, in score_node content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 348, in class_weight if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object has no attribute 'search' Traceback (most recent call last): File "/usr/lib/python3.4/site-packages/readability/readability.py", line 163, in summary candidates = self.score_paragraphs() File "/usr/lib/python3.4/site-packages/readability/readability.py", line 300, in score_paragraphs candidates[parent_node] = self.score_node(parent_node) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 360, in score_node content_score = self.class_weight(elem) File "/usr/lib/python3.4/site-packages/readability/readability.py", line 348, in class_weight if self.negative_keywords and self.negative_keywords.search(feature): AttributeError: 'list' object has no attribute 'search'

Could some one give me please a hint on the error or how to deal with it?

Upvotes: 1

Views: 1254

Answers (1)

Rushy Panchal
Rushy Panchal

Reputation: 17532

There's an error in the library code. If you look at compile_pattern:

def compile_pattern(elements):
    if not elements:
        return None
    elif isinstance(elements, (list, tuple)):
        return list(elements)
    elif isinstance(elements, regexp_type):
        return elements
    else:
        # assume string or string like object
        elements = elements.split(',')
        return re.compile(u'|'.join([re.escape(x.lower()) for x in elements]), re.U)

You can see that it only returns a regex if the elements is not None, not a list or tuple, and not a regular expression.

Later on, though, it assumes that self.negative_keywords is a regular expression. So, I suggest you input your list as a string in the form of "test_keyword1,test_keyword2". This will make sure that compile_pattern returns a regular expression which should fix the error.

Upvotes: 1

Related Questions