Reputation: 507
I've got multiple .txt
files (c.2.5GB each) which I want to load into a SQL
db. I'm trying to use Pandas
for this, by first reading each file into a dataframe
(maybe in chunks to avoid memory issue) and then loading the dataframe
into a SQL
db.
Each file only has 2 columns (column delimiter is ¬
), however the values in the second column could be split across multiple lines. The column values are qualified/wrapped in double quotes
. When I read the file, pandas
chops these lines into multiple lines every time it encounters newline characters - with data then going into the wrong columns. I want to read everything enclosed in double quotes
as a single text-stream/value. How can I correct this?
PS - sometimes the column delimiter is also contained within the data, but within double quotes
.
Sample Data
"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
*******************************************************************
This line also includes the column delimiter within text qualifier
*******************************************************************
# !¬!¬!¬|
"
My Code
import pandas as pd
import os
from dbconnection import DBConnection
path = r'C:\Sample_load_file.txt'
df = pd.read_csv(path, quotechar='"', engine='python', sep = "\¬")
#Check
df
#Add meta data
df['loaded_by'] = 'xxxx'
df['file_line_number'] = range(2,len(df)+2)
df['load_date'] = pd.datetime.now()
df['source_file'] = path
df['loading_script'] = r'xxxx.ipynb'
#db = DBConnection(server ='XXXX', database='XXXX')
df.to_sql('table_name', db.engine, schema='src', index=False, if_exists='append')
df_from_sql = pd.read_sql('select count(*) from src.table_name', db.engine)
Upvotes: 0
Views: 937
Reputation: 986
Here is a script from which you can start to process your data
Load text filewith open('data.txt') as file:
data = file.read()
Build a list of lines based on ' "\n" ' as custom separator
data = data.split('"\n"')
Split each line based on ' "¬" ' as custom separator
data = [line.split('"¬"') for line in data]
Cleanup remaining double quotes
data = [[e.replace('"', '') for e in line] for line in data]
Load data into a dataframe
df = pd.DataFrame(data)
Reformat dataframe
df.columns = df.iloc[0] # Set first row as column index
df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
data = """"LINE_ID"¬"LINE_TEXT"
"C1111-G00-BC222"¬"this line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
"
"C22-f0-333"¬"2nd row. This line is
split into
multiple lines
% All needs to be read into 1 line
% Currently that's not happening
*******************************************************************
This li
ne also includes the column delimiter within text qualifier
*******************************************************************
# !¬!¬!¬|
"
"""
data = data.split('"\n"')
data = [line.split('"¬"') for line in data]
data = [[e.replace('"', '') for e in line] for line in data]
df = pd.DataFrame(data)
df.columns = df.iloc[0] # Set first row as column index
df = df.iloc[1:].reset_index(drop=True) # Drop first line and reset index
df
Upvotes: 1