Reputation: 1247
So I have a .txt file as below:
IndentNo|Date|Short_Desc|PurchaseGroup|IndentType
100223|23.09.2020|"6"3 bendend pipe dia 4.5 m"|GRP_1||
100223|14.05.2021|"IM_13#22D FEMALE PLUG|6#|GRP_2||
I am converting the above into a pandas dataframe. So I am using the following:
lines = []
with open(os.path.join(dir_path_input,f),'r',encoding='utf-8') as f_m:
f_r = f_m.readlines()
for l in f_r:
l = l.replace('\n','')
f_r_s = l.split('|')
lines.append(f_r_s)
f_m.close()
df = pd.DataFrame(lines)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
Using the above method, I am getting the datframe as follows:
IndentNo Date Short_Desc PurchaseGroup IndentType
100223 23.09.2020 "6"3 bendend pipe dia 4.5 m" GRP_1
100223 14.05.2021 "IM_13#22D FEMALE PLUG 6# GRP_2 ##<----Wrong entry.
As you can see that in the last row, pandas have inserted data not as per expectation. The last row should be like
IndentNo Date Short_Desc PurchaseGroup IndentType
100223 23.09.2020 "6"3 bendend pipe dia 4.5 m" GRP_1
100223 14.05.2021 "IM_13#22D FEMALE PLUG|6# GRP_2
Is there any better way to read pipe delimited files. Or can the above code snippet be modified to consider the desired output?
Upvotes: 0
Views: 488
Reputation: 18426
That's because the columns in the data also contain |
, and there is no way you can know which is the delimiter and which is the part of the data. If the values containing |
was quoted, then it'd have been easy to parse them, but that's not the case. However, looking at your data, the problem seems to be on Short_Desc
column values only, so, you can try splitting on both direction with maxsplit
paramter:
from io import StringIO
f_m = StringIO('''IndentNo|Date|Short_Desc|PurchaseGroup|IndentType
100223|23.09.2020|"6"3 bendend pipe dia 4.5 m"|GRP_1||
100223|14.05.2021|"IM_13#22D FEMALE PLUG|6#|GRP_2||''')
lines = []
f_r = f_m.readlines()
for l in f_r:
l = l.strip('\n')
f_r_s = l.split('|', maxsplit=2) # maximum two splits from right
f_r_s = f_r_s[:-1]+f_r_s[-1].rsplit('|', maxsplit=3) # maximum 3 splits from left
lines.append(f_r_s)
f_m.close()
df = pd.DataFrame(lines)
df.columns = df.iloc[0]
df = df.drop(df.index[0])
OUTPUT:
0 IndentNo Date Short_Desc PurchaseGroup IndentType NaN
1 100223 23.09.2020 "6"3 bendend pipe dia 4.5 m" GRP_1
2 100223 14.05.2021 "IM_13#22D FEMALE PLUG|6# GRP_2
Notice that it creates a NaN
column because of double delimiter at the end i.e. ||
, you can just later drop this NaN
column, either by df[df.columns[:-1]]
, or by df[df.columns[df.columns.notna()]]
.
df[df.columns[df.columns.notna()]]
0 IndentNo Date Short_Desc PurchaseGroup IndentType
1 100223 23.09.2020 "6"3 bendend pipe dia 4.5 m" GRP_1
2 100223 14.05.2021 "IM_13#22D FEMALE PLUG|6# GRP_2
Upvotes: 1