Mohit
Mohit

Reputation: 137

Need to split one Columns data into different columns in Pandas Data frame

I have a Csv file which has merged columns in it and pandas dataframe is showing it in same way. need to split column as required out put

Csv Sample

my current input from csv file is:

"Date InformationIdNo.","Date out","Dr.","Cr."
"01 FEB Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0","250 - PQRT14225","","",""
"01 FEB Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0","20000 - DEFG12","","",""
"01 FEB TransferFT22032FQWE3","01 FEB 21","0","7,436.93","","","",""

also need to Index-1 with information column of 0th index

Required output needed :

|   |  Date     |  Information                  | IdNo.           | Date out  | Dr.       | Cr.       | Balance    |
|---|-----------|-------------------------------|-----------------|-----------|-----------|-----------|------------|
| 0 | 01 FEB 21 | Mart Purchase 250 - PQRT14225 |  MATRSC203255H  | 30 DEC 21 | -3,535.61 |  0        | -3,978.61  |
| 1 | 01 FEB 21 | Cash Sales 20000 - DEFG1220   |  MATRSC203255H  | 30 DEC 21 | -34.33    |  0        | -3,944.29  |
| 2 | 01 FEB 21 | Transfer                      |  FT22032FQWE3   | 01 FEB 21 | 0         |  7,426.93 |  3,482.65  |

Input CSV file Screenshot

CSV file when opened in notepad

Output required

Upvotes: 2

Views: 795

Answers (1)

gremur
gremur

Reputation: 1690

I believe the idea of the code below is quite clear. First we need to correct data in the csv file to valid csv (comma separated value) format. After that we can create dataframe.

'data.csv' file content

"Date InformationIdNo."," Date out ","Dr."," Cr."
"01 FEB 21 Mart Purchase MATRSC203255H","30 DEC 21","-3,535.61","0"
"250 - PQRT14225","","",""
"01 FEB 21 Cash Sales CCTR220307AXCDV","30 DEC 21","-34.33","0"
"20000 - DEFG12","","",""
"01 FEB 21 TransferFT22032FQWE3"," 01 FEB 21","0","7,426.93"
"","","",""
"","","",""
"","","",""
"","","",""
"","","",""

Possible (quick) solution is the following:

#pip install pandas

import re
import pandas as pd
from io import StringIO

with open("data.csv", "r", encoding='utf-8') as file:
    raw_data = file.read()
    
# convert txt to valid csv (comma separated values) format
raw_data = raw_data.replace(' - ', '-')
raw_data = raw_data.replace('Date InformationIdNo.', 'Date","Information","IdNo.')
raw_data = raw_data.replace('" Cr."', '"Cr","Information_add"')
raw_data = re.sub('(\d{2} [A-Z]{3} \d{2})', r'\1","', raw_data)
raw_data = re.sub('\n"([A-Z0-9-]+)","","",""\n', r',"\1"\n', raw_data)
raw_data = re.sub(r',""{2,}', '', raw_data)
raw_data = re.sub('([A-Z0-9]{3,}",")', r'","\1","', raw_data)
raw_data = re.sub(',""+', r'', raw_data)
raw_data = re.sub('\n""+', r'', raw_data)

# # create dataframe and replace NaN with ""
df = pd.read_csv(StringIO(raw_data), sep=",")
df.fillna("", inplace=True)

# merge columns and drop temporary column
df['Information'] = df['Information'] + df['Information_add']
df.drop(['Information_add'], axis=1, inplace=True)

# cleanup column headers
df.columns = [name.strip() for name in df.columns]

# convert date to datetime format
df['Date'] = pd.to_datetime(df['Date'].str.title().str.strip(), format="%d %b %y", dayfirst=True)
df['Date out'] = pd.to_datetime(df['Date out'].str.title().str.strip(), format="%d %b %y", dayfirst=True)

df

Returns

enter image description here

Upvotes: 2

Related Questions