Phani
Phani

Reputation: 863

Parse output to dataframe

I need to parse a BOT output and convert to a table format. Below is the link to the raw file and also it looks like this

https://www.dropbox.com/s/ab7sdl74krwltrd/raw_file.txt?dl=0

Invoice: 13613289,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 5: NM1~40~2~GEHC~~~~~46~GEHC,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 32: CLM~03X13613289-7~1968~~~24>B>1~Y~A~Y~Y,
HEAD Segment 40: REF~BDE~229~BROWARD COUNTY   AABC~219:BRD,
HEAD Segment 51: HI~ABK>N471,
HEAD Segment 54: HI~BP>54150,
"HEAD Segment 58: NM1~77~2~OUTPATIENT SURGICAL SERVICES, LTD~~~~~XX~1427012202",
HEAD Segment 60: N4~PLANTATION~FL~333241811,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00920>QZ>P2>X4>>~1968~MJ~65~~~1,
BODY Segment 3: DTP~472~D8~20210406,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828170,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 8: NM1~85~2~JLR MEDICAL GROUP~~~~~XX~1518910520,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 57: PRV~PE~PXC~207L00000X,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QK>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,
Invoice: 13828171,
------------------------,
HEAD Segment 1: ST~ST~~005010X222A1,
HEAD Segment 6: HL~1~~20~1,
HEAD Segment 16: HL~2~1~22~0,
HEAD Segment 41: REF~DDE~6~JLR MEDICAL GROUP ANESTHESIA~200:JLRAA,
HEAD Segment 57: PRV~PE~PXC~367H00000X,
HEAD Segment 70: OI~~~Y~B~~Y,
HEAD Segment 74: NM1~PR~2~TRICARE EAST REGION~~~~~PI~GET93776P,
BODY Segment 1: LX~1,
BODY Segment 2: SV1~HC>00402>QX>X4>>>~1722~MJ~108~~~1>2,
BODY Segment 3: DTP~472~D8~20210426,
BODY Segment 7: REF~6R~1,
,
END,

It has an Invoice and then variable HEAD Segments from 1 to 100 and variable BODY Segments from 1 to 10. My thought is first create an empty pandas dataframe with columns Invoice, HEAD Segment 1, HEAD Segment 2,..... HEAD Segment 100, BODY Segment 1, BODY Segment 2,.....BODY Segment 10 and create a list for each parsed element and then map the list values based on column name of df. Lastly remove any df columns that have all null values. But I am having hard time mapping the content of list value based on df column name and inserting to df. I got it so far.

import pandas as pd

filename = 'Z:\\Projects\\Compliance BOT\\raw_file.txt'

a_list = ['Invoice']  
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])    
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])   
df = pd.DataFrame(columns = a_list)

l = []
with open(filename, 'r', encoding='utf-8-sig') as input:
    for line in input:
        print(line.strip())
        line = line.strip()
        if line.startswith('"') and line.endswith('",'):
            line = line[1:-1]
        l.append(line.strip())        
        if '------------------------,' in l: l.remove('------------------------,')
        if ',' in l: l.remove(',')
        # l = [x[:-1] for x in l]
        l1 = [x[:-1] for x in l]
        if line.strip() == ',':
            break  
print(l)

The expected output should be like this

https://www.dropbox.com/scl/fi/j7mpwy84ne37mtyaprwp3/output.xlsx?dl=0&rlkey=lluo9lsc2rgcd4lxepwk0hb1u

Also please let me know if I am complicating this and there is any other simple way to do it?

Upvotes: 0

Views: 387

Answers (1)

Tejas
Tejas

Reputation: 401

Try this code

import pandas as pd

filename = 'Completed_07202021.csv'

a_list = ['Invoice']  
a_list.extend(['HEAD Segment '+ str(x) for x in range(1,100 + 1)])    
a_list.extend(['BODY Segment '+ str(x) for x in range(1,10 + 1)])   
df = pd.DataFrame(columns = a_list)

l = []
l.append(df)
out={}
with open(filename, 'r', encoding='utf-8-sig') as input:
    for line in input:
        print(line)
        if line.strip() == ',':
            continue  
        line = line.strip()
        if line.startswith('"') and line.endswith('",'):
            line = line[1:-1]
        if 'END,' in line:
            l.append(pd.DataFrame.from_dict(out))
            out={}
        data = line.strip().split(':')
        if len(data)!=2:
            continue
        out_str = ':'.join(data[1:])
        out[data[0].strip()]=[out_str[:-1].strip() if out_str.endswith(',') else out_str.strip()]
        
df_out = pd.concat(l)
df_out

let me know if you have any issue with this code

Upvotes: 1

Related Questions