Pandas load qualified .txt file which has column values split across multiple rows

Question

I've got multiple .txt files (c.2.5GB each) which I want to load into a SQL db. I'm trying to use Pandas for this, by first reading each file into a dataframe (maybe in chunks to avoid memory issue) and then loading the dataframe into a SQL db.

Each file only has 2 columns (column delimiter is ¬), however the values in the second column could be split across multiple lines. The column values are qualified/wrapped in double quotes. When I read the file, pandas chops these lines into multiple lines every time it encounters newline characters - with data then going into the wrong columns. I want to read everything enclosed in double quotes as a single text-stream/value. How can I correct this?

PS - sometimes the column delimiter is also contained within the data, but within double quotes.

Sample Data

"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This line also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"

My Code

import pandas as pd
import os
from dbconnection import DBConnection

path = r'C:\Sample_load_file.txt'
df = pd.read_csv(path, quotechar='"', engine='python', sep = "\¬")


#Check
df

#Add meta data
df['loaded_by'] = 'xxxx'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'xxxx.ipynb'

#db = DBConnection(server ='XXXX', database='XXXX')
df.to_sql('table_name', db.engine, schema='src', index=False, if_exists='append')

df_from_sql = pd.read_sql('select count(*) from src.table_name', db.engine)

Pandas load qualified .txt file which has column values split across multiple rows

Answers (1)

UPDATE: in one script

Related Questions