saurabh dhokale
saurabh dhokale

Reputation: 3

Optimize string split to get a pandas dataframe

I am extracting data from a large pdf file using regex using python in databricks. This data is in form of a long string and I am using string split function to convert this into a pandas dataframe as I want the final data as csv file. But while doing line.split command it takes about 5 hours for the command to run and I am looking for ways to optimize this. I am new to python and I am not sure which part of the code should I look at for reducing this time of running the command.

for pdf in os.listdir(data_directory):
  # creating an object
  file = open(data_directory + pdf, 'rb')

  # creating file reader object
  fileReader = PyPDF2.PdfFileReader(file)

  num_pages = fileReader.numPages 
  #print("total pages = " + str(num_pages))
  extracted_string = "start of file"
  current_page = 0 
  while current_page < num_pages:
    #print("adding page " + str(current_page) + " to the file")
    extracted_string += (fileReader.getPage(current_page).extract_text())
    current_page = current_page + 1

  regex_date = "\d{2}\/\d{2}\/\d{4}[^\n]*"

  table_lines = re.findall(regex_date, extracted_string)

Above code is to get the data from PDF

#create dataframe out of extracted string and load into a single dataframe  
for line in table_lines:
    df = pd.DataFrame([x.split(' ') for x in line.split('\n')])
    df.rename(columns={0: 'date_of_import', 1: 'entry_num', 2: 'warehouse_code_num', 3: 'declarant_ref_num', 4: 'declarant_EORI_num', 5: 'VAT_due'}, inplace=True)
    table = pd.concat([table,df],sort= False)

This part of the code is what is taking up huge time. I have tried different ways to get a dataframe out of this data but the above has worked best for me. I am looking for faster way to run this code.

https://drive.google.com/file/d/1ew3Fw1IjeToBA-KMbTTD_hIINiQm0Bkg/view?usp=share_link pdf file for reference

Upvotes: 0

Views: 68

Answers (1)

Serge Ballesta
Serge Ballesta

Reputation: 149085

There are 2 immediate optimization steps in your code.

  1. Pre-compile regex if they are used many times. It may or not be relevant here, because I could not guess how many times table_lines = re.findall(regex_date, extracted_string) is executed. But this if often more efficient:

     # before any loop
     regex_date = re.compile("\d{2}\/\d{2}\/\d{4}[^\n]*")
     ...
         # inside the loop
         table_lines = regex_date.findall(extracted_string)
    
  2. Do not repeatedly append to a dataframe. A dataframe is a rather complex container, and appending rows is a costly operation. It is generally much more efficient do build a Python container (list or dict) first and then convert it as a whole to a dataframe

     data = [[x.split(' ') for x in line.split('\n')] for line in table_lines]
     table = pd.DataFrame(data, columns = ['date_of_import', 'entry_num',
                                           'warehouse_code_num', 'declarant_ref_num',
                                           'declarant_EORI_num', 'VAT_due'])
    

Upvotes: 1

Related Questions