Pandas dataframe from a messy list of lists

Question

I have a very ugly data import coming from a client, in a .net file. I have managed to transform this to a list of lists. An example of a list is gven:

['* Table: Movement one
',
 '* 
',
 '$TSYS:CODE;NAME;TYPE;PCU
',
 'A;Car;PrT;1.000
',
 'Air_Bus;Airport Bus;PuT;1.000
',
 'B;Bus;PuT;1.000
',
 'C;Company Bus;PrT;2.000
',
 'CB;City Bus;PuT;1.000
',',
 'FE;Ferry;PuT;1.000
',
 'GV1;2-Axle Rigid Goods Vehicle;PrT;1.500
',
 'GV2;3/4 Axle Rigid Goods Vehicle;PrT;2.000
',
 'GV3;3/4 Axle Artic Goods Vehicle;PrT;3.000
',
 'GV4;5+ Axle Artic Goods Vehicle;PrT;3.000
',
 'IB;Intercity Bus;PuT;1.000
',
 'IN;Industry Bus;PuT;1.000
',
 'Loc;Local Bus;PuT;1.000
',
 'LR;Light Rail;PuT;1.000
',
 'R;Rail;PuT;1.000
',
 'S;School Bus;PrT;2.000
',
 'T;Taxi;PrT;1.100
',
 'TR;Tram;PuT;1.000
',
 'W;Walk;PrT;0.000
',
 'WB;WaterBus;PuT;1.000
',
 'WT;Water Taxi;PuT;1.000
',
 'W_PuT;Walk_PuT;PuTWalk;1.000
',
 '
',
 '* 
']

I wish to load this into a pandas dataframe.

The top two lines and bottom two lines may be discarded. Each list contains a string record, with ; separators. I know that the separator function for read_csv exists, but this won't work here as I am not reading from a file at this point. The column headings are also complex. The first $TSYS record must be discarded and the remaining used as column names. I can use strip to remove the in each record.

I have tried to simply load as a dataframe:

results_df = pd.DataFrame(results[2:-2])
print(results_df.head())

                                 0
0       $TSYS:CODE;NAME;TYPE;PCU

1                A;Car;PrT;1.000

3  Air_Bus;Airport Bus;PuT;1.000

4                B;Bus;PuT;1.000

Since I have many of these lists, how do I programtically take the 3rd line, remove the first string and create column headers from the remaining? How do I correctly separate by the ; for each subsequent record?

jezrael · Accepted Answer

You can use list comprehension where remove by strip and split values:

results_df = pd.DataFrame([x.strip().split(';') for x in results[3:-2]])
results_df.columns = results[2].strip().split(';')

print(results_df)

   $TSYS:CODE                          NAME     TYPE    PCU
0           A                           Car      PrT  1.000
1     Air_Bus                   Airport Bus      PuT  1.000
2           B                           Bus      PuT  1.000
3           C                   Company Bus      PrT  2.000
4          CB                      City Bus      PuT  1.000
5          FE                         Ferry      PuT  1.000
6         GV1    2-Axle Rigid Goods Vehicle      PrT  1.500
7         GV2  3/4 Axle Rigid Goods Vehicle      PrT  2.000
8         GV3  3/4 Axle Artic Goods Vehicle      PrT  3.000
9         GV4   5+ Axle Artic Goods Vehicle      PrT  3.000
10         IB                 Intercity Bus      PuT  1.000
11         IN                  Industry Bus      PuT  1.000
12        Loc                     Local Bus      PuT  1.000
13         LR                    Light Rail      PuT  1.000
14          R                          Rail      PuT  1.000
15          S                    School Bus      PrT  2.000
16          T                          Taxi      PrT  1.100
17         TR                          Tram      PuT  1.000
18          W                          Walk      PrT  0.000
19         WB                      WaterBus      PuT  1.000
20         WT                    Water Taxi      PuT  1.000
21      W_PuT                      Walk_PuT  PuTWalk  1.000

Pandas dataframe from a messy list of lists

Answers (1)

Related Questions