pythondumb
pythondumb

Reputation: 1247

Ignoring unnecessary pipes while reading a pipe delimited data using pandas

So I have a .txt file as below:

IndentNo|Date|Short_Desc|PurchaseGroup|IndentType
100223|23.09.2020|"6"3 bendend pipe dia 4.5 m"|GRP_1||
100223|14.05.2021|"IM_13#22D FEMALE PLUG|6#|GRP_2||

I am converting the above into a pandas dataframe. So I am using the following:

lines = []
with open(os.path.join(dir_path_input,f),'r',encoding='utf-8') as f_m:
  f_r = f_m.readlines()
    for l in f_r:
      l = l.replace('\n','')
      f_r_s = l.split('|')
      lines.append(f_r_s)
  f_m.close()   
  df = pd.DataFrame(lines)
  df.columns = df.iloc[0]
  df = df.drop(df.index[0])

Using the above method, I am getting the datframe as follows:

IndentNo  Date       Short_Desc                  PurchaseGroup  IndentType
100223  23.09.2020  "6"3 bendend pipe dia 4.5 m"   GRP_1
100223  14.05.2021  "IM_13#22D FEMALE PLUG           6#          GRP_2 ##<----Wrong entry. 

As you can see that in the last row, pandas have inserted data not as per expectation. The last row should be like

IndentNo  Date       Short_Desc                  PurchaseGroup  IndentType
100223  23.09.2020  "6"3 bendend pipe dia 4.5 m"   GRP_1
100223  14.05.2021  "IM_13#22D FEMALE PLUG|6#      GRP_2 

Is there any better way to read pipe delimited files. Or can the above code snippet be modified to consider the desired output?

Upvotes: 0

Views: 488

Answers (1)

ThePyGuy
ThePyGuy

Reputation: 18426

That's because the columns in the data also contain |, and there is no way you can know which is the delimiter and which is the part of the data. If the values containing | was quoted, then it'd have been easy to parse them, but that's not the case. However, looking at your data, the problem seems to be on Short_Desc column values only, so, you can try splitting on both direction with maxsplit paramter:

from io import StringIO
f_m = StringIO('''IndentNo|Date|Short_Desc|PurchaseGroup|IndentType
100223|23.09.2020|"6"3 bendend pipe dia 4.5 m"|GRP_1||
100223|14.05.2021|"IM_13#22D FEMALE PLUG|6#|GRP_2||''')

lines = []
f_r = f_m.readlines()
for l in f_r:
    l = l.strip('\n')
    f_r_s = l.split('|', maxsplit=2) # maximum two splits from right
    f_r_s = f_r_s[:-1]+f_r_s[-1].rsplit('|', maxsplit=3) # maximum 3 splits from left
    lines.append(f_r_s)
f_m.close()
df = pd.DataFrame(lines)
df.columns = df.iloc[0]
df = df.drop(df.index[0])

OUTPUT:

0 IndentNo        Date                    Short_Desc PurchaseGroup IndentType  NaN
1   100223  23.09.2020  "6"3 bendend pipe dia 4.5 m"         GRP_1                 
2   100223  14.05.2021     "IM_13#22D FEMALE PLUG|6#         GRP_2               

Notice that it creates a NaN column because of double delimiter at the end i.e. ||, you can just later drop this NaN column, either by df[df.columns[:-1]], or by df[df.columns[df.columns.notna()]].

df[df.columns[df.columns.notna()]]

0 IndentNo        Date                    Short_Desc PurchaseGroup IndentType
1   100223  23.09.2020  "6"3 bendend pipe dia 4.5 m"         GRP_1           
2   100223  14.05.2021     "IM_13#22D FEMALE PLUG|6#         GRP_2   

Upvotes: 1

Related Questions