Marcelo Soares
Marcelo Soares

Reputation: 187

Turning incomplete lines into new columns on Pandas

Folks,

I have converted a PDF using tabula-py and, due to the formatting (there are two lines with names in each name cell) I get this:

col1  name      doc  col4  col5  col6
XXX   John Doe  XXX  XXX   XXX   XXX
NaN   Jane Way  YYY  NaN   NaN   NaN

I want to turn the two filled columns on the rows with NaN into new columns on the full rows before (name2 and doc2, maybe). Like this:

col1  name      doc  col4  col5  col6  name2     doc2
XXX   John Doe  XXX  XXX   XXX   XXX   Jane Way  YYY

I thought of separating the filled and unfilled rows into different dataframes, then merging, but I don't think it applies, since the number of columns with NaNs is different of the number of columns without them.

The best way would be, when there is a NaN on the first column, to attribute the filled values to new columns in the previous row.

One thing I tried, that didn't work (Python didn't recognize the new column name as something valid):

df['col1'] = df['col1'].replace(np.NaN, '|')
for i in range(len(df)):
  if df['col1'][i] == '|':
    df['name2'][i-1] = df['name'][i]
    df['doc2'][i-1] = df['doc'][i]

Upvotes: 0

Views: 91

Answers (1)

BeRT2me
BeRT2me

Reputation: 13242

Given:

  col1      name  doc col4 col5 col6
0  XXX  John Doe  XXX  XXX  XXX  XXX
1  NaN  Jane Way  YYY  NaN  NaN  NaN

If you do:

pd.concat([df[0::2], df[1::2]], axis=1).dropna(axis=1, how='all').bfill().dropna()

Output:

  col1      name  doc col4 col5 col6      name  doc
0  XXX  John Doe  XXX  XXX  XXX  XXX  Jane Way  YYY

You can rename the columns as needed~

Upvotes: 0

Related Questions