Reputation: 1512
I have the following file:
OBJ OBJ OBJ
DATE OBJ CLASS DATE OBJ OBJ OBJ
CLASS ENTERED1 TAG REPL ENTERED2 NAME TSTMP USERID
----- ---------- --- ----- ---------- ---------------------------------------- -------------------------- --------
EZ4 2002-02-06 D 2002-02-06 abc random 2002-02-06-10.05.30.242768 2342342
000 2001-09-27 D 2001-09-27 none 2001-09-27-10.23.31.121606 2343242
011 1974-07-09 0001-01-01 board COMPENSATON 1997-01-22-09.10.23.145853
The columns are defined with underscores and the header texts are broken up across multiple rows. I'd like to read this into a pandas dataframe such that the column texts are intact and the underscores are removed. What can I do to get the dataframe in the format I'm seeking?
Upvotes: 2
Views: 9409
Reputation: 294488
I'd use pd.read_fwf
for fixed width but specify a separator of '\s{2,}'
g = lambda x: '' if x.startswith('Unnamed') else x
pd.read_fwf(
'file.txt',
sep='\s{2,}',
header=[0, 1, 2],
skiprows=[3]
).rename(columns=g)
OBJ OBJ
DATE OBJ CLASS DATE OBJ
CLASS ENTERED1 TAG REPL ENTERED2 NAME TSTMP USERID
0 EZ4 2002-02-06 D NaN 2002-02-06 abc random 2002-02-06-10.05.30.242768 2342342.0
1 000 2001-09-27 D NaN 2001-09-27 none 2001-09-27-10.23.31.121606 2343242.0
2 011 1974-07-09 NaN NaN 0001-01-01 board COMPENSATON 1997-01-22-09.10.23.145853 NaN
Upvotes: 2
Reputation: 402812
It looks like the first two rows and the third are unnecessary, so skip them by specifying a skiprows
attribute.
df = pd.read_csv('file.txt', skiprows=(0, 1, 3), sep='\s{2,}', engine='python')
Upvotes: 2
Reputation: 719
Store the text information before you get to the underscores. When you get to the underscores save the start and end positions of each "column" then go back and concatenate that text as appropriate. Consider writing as a reusable function ... you will need to do something like this again someday :-)
Upvotes: 0