Chipmunk_da
Chipmunk_da

Reputation: 507

Pandas load qualified .txt file which has column values split across multiple rows

I've got multiple .txt files (c.2.5GB each) which I want to load into a SQL db. I'm trying to use Pandas for this, by first reading each file into a dataframe (maybe in chunks to avoid memory issue) and then loading the dataframe into a SQL db.

Each file only has 2 columns (column delimiter is ¬), however the values in the second column could be split across multiple lines. The column values are qualified/wrapped in double quotes. When I read the file, pandas chops these lines into multiple lines every time it encounters newline characters - with data then going into the wrong columns. I want to read everything enclosed in double quotes as a single text-stream/value. How can I correct this?

PS - sometimes the column delimiter is also contained within the data, but within double quotes.

Sample Data

"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This line also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"

My Code

import pandas as pd
import os
from dbconnection import DBConnection

path = r'C:\Sample_load_file.txt'
df = pd.read_csv(path, quotechar='"', engine='python', sep = "\¬")


#Check
df

#Add meta data
df['loaded_by'] = 'xxxx'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'xxxx.ipynb'

#db = DBConnection(server ='XXXX', database='XXXX')
df.to_sql('table_name', db.engine, schema='src', index=False, if_exists='append')

df_from_sql = pd.read_sql('select count(*) from src.table_name', db.engine)

Upvotes: 0

Views: 937

Answers (1)

blondelg
blondelg

Reputation: 986

Here is a script from which you can start to process your data

Load text file
with open('data.txt') as file:
    data = file.read()
Build a list of lines based on ' "\n" ' as custom separator
data = data.split('"\n"')
Split each line based on ' "¬" ' as custom separator
data = [line.split('"¬"') for line in data]
Cleanup remaining double quotes
data = [[e.replace('"', '') for e in line] for line in data]
Load data into a dataframe
df = pd.DataFrame(data)
Reformat dataframe
df.columns = df.iloc[0] # Set first row as column index
df.iloc[1:].reset_index(drop=True) # Drop first line and reset index

UPDATE: in one script

data = """"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
  *******************************************************************
  This li
  ne also includes the column delimiter within text qualifier
  *******************************************************************
  # !¬!¬!¬|
"
"""

data = data.split('"\n"')
data = [line.split('"¬"') for line in data]
data = [[e.replace('"', '') for e in line] for line in data]
df = pd.DataFrame(data)
df.columns = df.iloc[0] # Set first row as column index
df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
df

Upvotes: 1

Related Questions