MLhacker
MLhacker

Reputation: 1512

Using pandas to read a text file with the headers in multiple rows

I have the following file:

        OBJ             OBJ     OBJ                                                                                           
       DATE        OBJ  CLASS  DATE        OBJ                                       OBJ                         OBJ          
CLASS  ENTERED1    TAG  REPL   ENTERED2    NAME                                      TSTMP                       USERID       
-----  ----------  ---  -----  ----------  ----------------------------------------  --------------------------  --------     
EZ4    2002-02-06  D           2002-02-06  abc random                                2002-02-06-10.05.30.242768  2342342      
000    2001-09-27  D           2001-09-27  none                                      2001-09-27-10.23.31.121606  2343242      
011    1974-07-09              0001-01-01  board COMPENSATON                         1997-01-22-09.10.23.145853  

The columns are defined with underscores and the header texts are broken up across multiple rows. I'd like to read this into a pandas dataframe such that the column texts are intact and the underscores are removed. What can I do to get the dataframe in the format I'm seeking?

Upvotes: 2

Views: 9409

Answers (3)

piRSquared
piRSquared

Reputation: 294488

I'd use pd.read_fwf for fixed width but specify a separator of '\s{2,}'

g = lambda x: '' if x.startswith('Unnamed') else x

pd.read_fwf(
    'file.txt',
    sep='\s{2,}',
    header=[0, 1, 2],
    skiprows=[3]
).rename(columns=g)

                OBJ        OBJ                                                                      
               DATE  OBJ CLASS        DATE                OBJ                                       
  CLASS    ENTERED1  TAG  REPL    ENTERED2               NAME                       TSTMP     USERID
0   EZ4  2002-02-06    D   NaN  2002-02-06         abc random  2002-02-06-10.05.30.242768  2342342.0
1   000  2001-09-27    D   NaN  2001-09-27               none  2001-09-27-10.23.31.121606  2343242.0
2   011  1974-07-09  NaN   NaN  0001-01-01  board COMPENSATON  1997-01-22-09.10.23.145853        NaN

Upvotes: 2

cs95
cs95

Reputation: 402812

It looks like the first two rows and the third are unnecessary, so skip them by specifying a skiprows attribute.

df = pd.read_csv('file.txt', skiprows=(0, 1, 3), sep='\s{2,}', engine='python')

Upvotes: 2

Ken Schumack
Ken Schumack

Reputation: 719

Store the text information before you get to the underscores. When you get to the underscores save the start and end positions of each "column" then go back and concatenate that text as appropriate. Consider writing as a reusable function ... you will need to do something like this again someday :-)

Upvotes: 0

Related Questions