Reputation: 863
I need to parse a BOT output and convert to a table format. Below is the link to the raw file and also it looks like this
https://www.dropbox.com/s/ab7sdl74krwltrd/raw_file.txt?dl=0
Invoice: 13613289,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 5: NM1~40~2~GEHC~~~~~46~GEHC,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 32: CLM~03X13613289-7~1968~~~24>B>1~Y~A~Y~Y,
HEAD Segment 40: REF~BDE~229~BROWARD COUNTY AABC~219:BRD,
HEAD Segment 51: HI~ABK>N471,
HEAD Segment 54: HI~BP>54150,
"HEAD Segment 58: NM1~77~2~OUTPATIENT SURGICAL SERVICES, LTD~~~~~XX~1427012202",
HEAD Segment 60: N4~PLANTATION~FL~333241811,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00920>QZ>P2>X4>>~1968~MJ~65~~~1,
BODY Segment 3: DTP~472~D8~20210406,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828170,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 8: NM1~85~2~JLR MEDICAL GROUP~~~~~XX~1518910520,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 57: PRV~PE~PXC~207L00000X,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QK>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828171,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 6: HL~1~~20~1,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 41: REF~DDE~6~JLR MEDICAL GROUP ANESTHESIA~200:JLRAA,
HEAD Segment 57: PRV~PE~PXC~367H00000X,
HEAD Segment 70: OI~~~Y~B~~Y,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QX>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,
It has an Invoice and then variable HEAD Segments from 1 to 100 and variable BODY Segments from 1 to 10. My thought is first create an empty pandas dataframe with columns Invoice, HEAD Segment 1, HEAD Segment 2,..... HEAD Segment 100, BODY Segment 1, BODY Segment 2,.....BODY Segment 10 and create a list for each parsed element and then map the list values based on column name of df. Lastly remove any df columns that have all null values. But I am having hard time mapping the content of list value based on df column name and inserting to df. I got it so far.
import pandas as pd
filename = 'Z:\\Projects\\Compliance BOT\\raw_file.txt'
a_list = ['Invoice']
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])
df = pd.DataFrame(columns = a_list)
l = []
with open(filename, 'r', encoding='utf-8-sig') as input:
for line in input:
print(line.strip())
line = line.strip()
if line.startswith('"') and line.endswith('",'):
line = line[1:-1]
l.append(line.strip())
if '------------------------,' in l: l.remove('------------------------,')
if ',' in l: l.remove(',')
# l = [x[:-1] for x in l]
l1 = [x[:-1] for x in l]
if line.strip() == ',':
break
print(l)
The expected output should be like this
Also please let me know if I am complicating this and there is any other simple way to do it?
Upvotes: 0
Views: 387
Reputation: 401
Try this code
import pandas as pd
filename = 'Completed_07202021.csv'
a_list = ['Invoice']
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])
df = pd.DataFrame(columns = a_list)
l = []
l.append(df)
out={}
with open(filename, 'r', encoding='utf-8-sig') as input:
for line in input:
print(line)
if line.strip() == ',':
continue
line = line.strip()
if line.startswith('"') and line.endswith('",'):
line = line[1:-1]
if 'END,' in line:
l.append(pd.DataFrame.from_dict(out))
out={}
data = line.strip().split(':')
if len(data)!=2:
continue
out_str = ':'.join(data[1:])
out[data[0].strip()]=[out_str[:-1].strip() if out_str.endswith(',') else out_str.strip()]
df_out = pd.concat(l)
df_out
let me know if you have any issue with this code
Upvotes: 1