Reputation: 1

word frequency in a pdf

I found this code on git hub and I can't get it to work. The point is to convert the pdf to .txt then list all the words with frequency of occurrence. I'm beginning to learn python but I'm not very good yet. I am getting this tracebac

 Traceback (most recent call last):
  File "C:\python38\Keyword-Extracter-master\keyword_extract_with_weight.py", line 31, in <module>
    keywords = re.findall(r'[a-zA-Z]\w+',text)
  File "C:\python_3.9\lib\re.py", line 241, in findall
    return _compile(pattern, flags).findall(string)
TypeError: cannot use a string pattern on a bytes-like object

Here's the code.

import pandas as pd
import numpy as np
import PyPDF2
import textract
import re

filename ='test.pdf' 

pdfFileObj = open(filename,'rb')               
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)   
num_pages = pdfReader.numPages                 


count = 0
text = ""
                                                            
while count < num_pages:                       
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
    
if text != "":
    text = text
 
else:
    text = textract.process('words.txt', method='tesseract', language='eng')


text = text.encode('ascii','ignore').lower()
keywords = re.findall(r'[a-zA-Z]\w+',text)

df = pd.DataFrame(list(set(keywords)),columns=['keywords'])



def weightage(word,text,number_of_documents=1):
    word_list = re.findall(word,text)
    number_of_times_word_appeared =len(word_list)
    tf = number_of_times_word_appeared/float(len(text))
    idf = np.log((number_of_documents)/float(number_of_times_word_appeared))
    tf_idf = tf*idf
    return number_of_times_word_appeared,tf,idf ,tf_idf 


    
df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text)[0])
df['tf'] = df['keywords'].apply(lambda x: weightage(x,text)[1])
df['idf'] = df['keywords'].apply(lambda x: weightage(x,text)[2])
df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text)[3])

df = df.sort_values('tf_idf',ascending=True)
df.head(25)

Any help is appreciated, thanks.

Upvotes: 0

Answers (2)

Sean Chinery

Reputation: 1

Here's how I got it to work

'''

import pandas as pd
import numpy as np
import PyPDF2
import textract
import re

filename ='test.pdf' 

pdfFileObj = open(filename,'rb')               
pdfReader = PyPDF2.PdfFileReader(pdfFileObj)   
num_pages = pdfReader.numPages                 


count = 0
text = ""
                                                            
while count < num_pages:                       
    pageObj = pdfReader.getPage(count)
    count +=1
    text += pageObj.extractText()
    
if text != "":
    text = text
 
else:
    text = textract.process('words.txt', method='tesseract', language='eng')


text = text.encode('ascii','ignore').lower()
text_decoded = text.decode()
keywords = re.findall(r'[a-zA-Z]\w+',text_decoded)


df = pd.DataFrame(list(set(keywords)),columns=['keywords'])



def weightage(word,text,number_of_documents=1):
    word_list = re.findall(word,text_decoded)
    number_of_times_word_appeared =len(word_list)
    tf = number_of_times_word_appeared/float(len(text_decoded))
    idf = np.log((number_of_documents)/float(number_of_times_word_appeared))
    tf_idf = tf*idf
    return number_of_times_word_appeared,tf,idf ,tf_idf 



    
df['number_of_times_word_appeared'] = df['keywords'].apply(lambda x: weightage(x,text)[0])
df['tf'] = df['keywords'].apply(lambda x: weightage(x,text)[1])
df['idf'] = df['keywords'].apply(lambda x: weightage(x,text)[2])
df['tf_idf'] = df['keywords'].apply(lambda x: weightage(x,text)[3])

df = df.sort_values('tf_idf',ascending=True)
df.head(25)     
df.to_csv('out_put.csv', index=False)
print(df)

'''

Upvotes: 0

saquintes

Reputation: 1089

Python3 has two different string types. bytes and str. Different API will work on different types. The error is telling you that re wants to deal with str not bytes. The line text = text.encode('ascii','ignore').lower() turns your text into bytes. So you will need to turn it back into a string with decode(), or never convert it to a byte-string in the first place.

Upvotes: 1

word frequency in a pdf

Answers (2)

Related Questions