Reputation: 105
I have written the code to extract the numbers and the company name from the extracted pdf file.
sample pdf content:
#88876 - Sample1, GTRHEUSKYTH, -99WED,-0098B
#99945 - SAMPLE2, DJWHVDFWHEF, -8876D,-3445G
The above example is what my pdf file contains. I wanted to extract the App number which is after # (i.e) five numbers(88876) and App name which is after the (-) (i.e) Sample1. An write that to an excel file as separate columns which is App_number and App_name.
Please refer the below code which I have tried.
import PyPDF2, re
import csv
for k in range(1,100):
pdfObj = open(r"C:\\Users\merge.pdf",'rb')
object = PyPDF2.PdfFileReader("C:\\Users\merge.pdf")
pdfReader = PyPDF2.PdfFileReader(pdfObj)
NumPages = object.getNumPages()
pdfReader.numPages
for i in range(0, NumPages):
pdfPageObj = pdfReader.getPage(i)
text = pdfPageObj.extractText()
x=re.findall('(?<=#).[0-9]+', text)
y=re.findall("(?<=\- )(.*?)(?=,)", text)
print(x)
print(y)
with open("out.csv", "w", newline="") as f:
writer = csv.writer(f)
writer.writerows(x)
Please pour some suggestions.
Upvotes: 1
Views: 223
Reputation: 792
Try this:
text = '#88876 - Sample1, GTRHEUSKYTH'
App_number = re.search('(?<=#).[0-9]+', text).group()
App_name = re.search("(?<=\- )(.*?)(?=,)", text).group()
In the first regex you get the first consecutive digits after #
, in the second one you get everything between -
and ,
Hope it helped
Upvotes: 1