George
George

Reputation: 105

How to extract the specific numbers or text using regex in python?

I have written the code to extract the numbers and the company name from the extracted pdf file.

sample pdf content:

#88876 - Sample1, GTRHEUSKYTH, -99WED,-0098B
#99945 - SAMPLE2, DJWHVDFWHEF, -8876D,-3445G

The above example is what my pdf file contains. I wanted to extract the App number which is after # (i.e) five numbers(88876) and App name which is after the (-) (i.e) Sample1. An write that to an excel file as separate columns which is App_number and App_name.

Please refer the below code which I have tried.

import PyPDF2, re
import csv
for k in range(1,100):
    pdfObj = open(r"C:\\Users\merge.pdf",'rb')
    object = PyPDF2.PdfFileReader("C:\\Users\merge.pdf")
    pdfReader = PyPDF2.PdfFileReader(pdfObj)
    NumPages = object.getNumPages()
    pdfReader.numPages

    for i in range(0, NumPages):
        pdfPageObj = pdfReader.getPage(i)
        text = pdfPageObj.extractText()
        x=re.findall('(?<=#).[0-9]+', text)
        y=re.findall("(?<=\- )(.*?)(?=,)", text)
        print(x)
        print(y) 

    with open("out.csv", "w", newline="") as f:
        writer = csv.writer(f)
        writer.writerows(x)

Please pour some suggestions.

Upvotes: 1

Views: 223

Answers (1)

imburningbabe
imburningbabe

Reputation: 792

Try this:

text = '#88876 - Sample1, GTRHEUSKYTH'


App_number = re.search('(?<=#).[0-9]+', text).group()
App_name = re.search("(?<=\- )(.*?)(?=,)", text).group()

In the first regex you get the first consecutive digits after #, in the second one you get everything between - and ,

Hope it helped

Upvotes: 1

Related Questions