Reputation: 187
Folks,
I have converted a PDF using tabula-py and, due to the formatting (there are two lines with names in each name cell) I get this:
col1 name doc col4 col5 col6
XXX John Doe XXX XXX XXX XXX
NaN Jane Way YYY NaN NaN NaN
I want to turn the two filled columns on the rows with NaN into new columns on the full rows before (name2 and doc2, maybe). Like this:
col1 name doc col4 col5 col6 name2 doc2
XXX John Doe XXX XXX XXX XXX Jane Way YYY
I thought of separating the filled and unfilled rows into different dataframes, then merging, but I don't think it applies, since the number of columns with NaNs is different of the number of columns without them.
The best way would be, when there is a NaN on the first column, to attribute the filled values to new columns in the previous row.
One thing I tried, that didn't work (Python didn't recognize the new column name as something valid):
df['col1'] = df['col1'].replace(np.NaN, '|')
for i in range(len(df)):
if df['col1'][i] == '|':
df['name2'][i-1] = df['name'][i]
df['doc2'][i-1] = df['doc'][i]
Upvotes: 0
Views: 91
Reputation: 13242
Given:
col1 name doc col4 col5 col6
0 XXX John Doe XXX XXX XXX XXX
1 NaN Jane Way YYY NaN NaN NaN
If you do:
pd.concat([df[0::2], df[1::2]], axis=1).dropna(axis=1, how='all').bfill().dropna()
Output:
col1 name doc col4 col5 col6 name doc
0 XXX John Doe XXX XXX XXX XXX Jane Way YYY
You can rename the columns as needed~
Upvotes: 0