Reputation: 1
I found this code on git hub and I can't get it to work. The point is to convert the pdf to .txt then list all the words with frequency of occurrence. I'm beginning to learn python but I'm not very good yet. I am getting this tracebac
Traceback (most recent call last):
File "C:\python38\Keyword-Extracter-master\keyword_extract_with_weight.py", line 31, in <module>
keywords = re.findall(r'[a-zA-Z]\w+',text)
File "C:\python_3.9\lib\re.py", line 241, in findall
return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object
Here's the code.
import pandas as pd
import numpy as np
import PyPDF2
import textract
import re
filename ='test.pdf'
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process('words.txt', method='tesseract', language='eng')
text = text.encode('ascii','ignore').lower()
keywords = re.findall(r'[a-zA-Z]\w+',text)
df = pd.DataFrame(list(set(keywords)),columns=['keywords'])
def weightage(word,text,number_of_documents=1):
word_list = re.findall(word,text)
number_of_times_word_appeared =len(word_list)
tf = number_of_times_word_appeared/float(len(text))
idf = np.log((number_of_documents)/float(number_of_times_word_appeared))
tf_idf = tf*idf
return number_of_times_word_appeared,tf,idf ,tf_idf
df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text)[0])
df['tf'] = df['keywords'].apply(lambda x: weightage(x,text)[1])
df['idf'] = df['keywords'].apply(lambda x: weightage(x,text)[2])
df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text)[3])
df = df.sort_values('tf_idf',ascending=True)
df.head(25)
Any help is appreciated, thanks.
Upvotes: 0
Views: 1586
Reputation: 1
Here's how I got it to work
'''
import pandas as pd
import numpy as np
import PyPDF2
import textract
import re
filename ='test.pdf'
pdfFileObj = open(filename,'rb')
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)
num_pages = pdfReader.numPages
count = 0
text = ""
while count < num_pages:
pageObj = pdfReader.getPage(count)
count +=1
text += pageObj.extractText()
if text != "":
text = text
else:
text = textract.process('words.txt', method='tesseract', language='eng')
text = text.encode('ascii','ignore').lower()
text_decoded = text.decode()
keywords = re.findall(r'[a-zA-Z]\w+',text_decoded)
df = pd.DataFrame(list(set(keywords)),columns=['keywords'])
def weightage(word,text,number_of_documents=1):
word_list = re.findall(word,text_decoded)
number_of_times_word_appeared =len(word_list)
tf = number_of_times_word_appeared/float(len(text_decoded))
idf = np.log((number_of_documents)/float(number_of_times_word_appeared))
tf_idf = tf*idf
return number_of_times_word_appeared,tf,idf ,tf_idf
df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text)[0])
df['tf'] = df['keywords'].apply(lambda x: weightage(x,text)[1])
df['idf'] = df['keywords'].apply(lambda x: weightage(x,text)[2])
df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text)[3])
df = df.sort_values('tf_idf',ascending=True)
df.head(25)
df.to_csv('out_put.csv', index=False)
print(df)
'''
Upvotes: 0
Reputation: 1089
Python3 has two different string types. bytes
and str
. Different API will work on different types. The error is telling you that re
wants to deal with str
not bytes
. The line text = text.encode('ascii','ignore').lower()
turns your text into bytes
. So you will need to turn it back into a string with decode()
, or never convert it to a byte-string in the first place.
Upvotes: 1