alapalak
alapalak

Reputation: 147

"TypeError: expected string or buffer" when using tika to parse documents in Python

I am trying to parse some documents (as listed in filetypes) using Apache Tika. This is my code in Python.

auth = urllib2.HTTPPasswordMgrWithDefaultRealm()
auth.add_password(None, url, user, password)
urllib2.install_opener(urllib2.build_opener(urllib2.HTTPBasicAuthHandler(auth)))

outpage = urllib2.urlopen(url)
data = json.loads(outpage.read().decode('utf-8'))
dictitems = data.values()
flattened_list = [y for x in dictitems for y in x]

filetypes = [".pdf", ".doc", ".docx", ".txt"]

def tikiparse(fi):
    for i in filetypes:
        if fi.endswith(i):
            text = parser.from_file(fi, "http://localhost:9998/")
            extractedcontent = text["content"]

            chunked = ne_chunk(pos_tag(word_tokenize(extractedcontent)))
            current_chunk = []
            cont_chunk = []

            for j in chunked:
                if type(j) == Tree:
                    current_chunk.append(" ".join([token for token, pos in j.leaves()]))
                elif current_chunk:
                    named_entity = " ".join(current_chunk)
                    if named_entity not in cont_chunk:
                        cont_chunk.append(named_entity)
                        current_chunk = []
                else:
                    continue
            return cont_chunk

The loop runs perfectly for a while and parses some documents to extract named entities. And abruptly, I get the following error. What is going wrong with the code?

Traceback (most recent call last):
  File "C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py", line 74, in <module>
    tikiparse(f)
  File "C:/Users/Kalapala/PycharmProjects/Attachments/DownloadFiles.py", line 57, in tikiparse
    chunked = ne_chunk(pos_tag(word_tokenize(extractedcontent)))
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 130, in word_tokenize
    sentences = [text] if preserve_line else sent_tokenize(text, language)
  File "C:\Python27\lib\site-packages\nltk\tokenize\__init__.py", line 97, in sent_tokenize
    return tokenizer.tokenize(text)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1235, in tokenize
    return list(self.sentences_from_text(text, realign_boundaries))
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1283, in sentences_from_text
    return [text[s:e] for s, e in self.span_tokenize(text, realign_boundaries)]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1274, in span_tokenize
    return [(sl.start, sl.stop) for sl in slices]
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1314, in _realign_boundaries
    for sl1, sl2 in _pair_iter(slices):
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 312, in _pair_iter
    prev = next(it)
  File "C:\Python27\lib\site-packages\nltk\tokenize\punkt.py", line 1287, in _slices_from_text
    for match in self._lang_vars.period_context_re().finditer(text):
TypeError: expected string or buffer

Process finished with exit code 1

Upvotes: 0

Views: 974

Answers (1)

Haifeng Zhang
Haifeng Zhang

Reputation: 31905

The issue you have is word_tokenise() expects a string, but you pass other type into the method. You have to make sure your extractedcontent is string type.

Based on your UnicodeDecodeErrorcomment, the values of dictionary text contain some characters which cannot be encoded/decode, you can call encode('utf-8').strip() on the value, for instance extractedcontent.encode('utf-8').strip() to resolve it.

Hope it helps.

Upvotes: 1

Related Questions