Reputation: 656
I have this kind of data that it's driving me crazy. The source is a pdf file that I read with tabula to extract tables. Problem is that some rows of the table are multiline in the document and this is how I see my output.
> sub_df.iloc[85:95]
1 Acronym Meaning
86 ABC Aaaaa Bbbbb Ccccc
87 CDE Ccccc Ddddd Eeeee
88 NaN Fffff Ggggg
89 FGH NaN
90 NaN Hhhhh
91 IJK Iiiii Jjjjj Kkkkk
92 LMN Lllll Mmmmm Nnnnn
93 OPQ Ooooo Ppppp Qqqqq
94 RST Rrrrr Sssss Ttttt
95 UVZ Uuuuu Vvvvv Zzzzz
What I would like to get is something like this.
> sub_df.iloc[85:95]
1 Acronym Meaning
86 ABC Aaaaa Bbbbb Ccccc
87 CDE Ccccc Ddddd Eeeee
88 FGH Fffff Ggggg Hhhhh
91 IJK Iiiii Jjjjj Kkkkk
92 LMN Lllll Mmmmm Nnnnn
93 OPQ Ooooo Ppppp Qqqqq
94 RST Rrrrr Sssss Ttttt
95 UVZ Uuuuu Vvvvv Zzzzz
I am struggling with combine_first like this:
sub_df.iloc[[88]].combine_first(sub_df.iloc[[87]])
but the result is not what I am expecting.
Also a solution with groupby would be appreciated.
Note: index is not important and it can be reset. I just wanna join some consecutive rows whose columns are NaN and then dump it to csv, so I don't need them.
Upvotes: 4
Views: 595
Reputation: 153460
Let's try this:
df = df.assign(Meaning = df['Meaning'].ffill())
mask = ~((df.Meaning.duplicated(keep='last')) & df.Acronym.isnull())
df = df[mask]
df = df.assign(Acronym = df['Acronym'].ffill())
df_out = df.groupby('Acronym').apply(lambda x: ' '.join(x['Meaning'].str.split('\s').sum())).reset_index()
Output:
Acronym 0
0 ABC Aaaaa Bbbbb Ccccc
1 CDE Ccccc Ddddd Eeeee
2 FGH Fffff Ggggg Hhhhh
3 IJK Iiiii Jjjjj Kkkkk
4 LMN Lllll Mmmmm Nnnnn
5 OPQ Ooooo Ppppp Qqqqq
6 RST Rrrrr Sssss Ttttt
7 UVZ Uuuuu Vvvvv Zzzzz
Upvotes: 2
Reputation: 18647
Here is an approach using numpy.where
to do a conditional fill:
df['Acronym'] = np.where(df[['Acronym']].assign(Meaning=df.Meaning.shift()).isna().all(1),
df.Acronym.ffill(),
df.Acronym.bfill())
clean_meaning = df.dropna().groupby('Acronym')['Meaning'].apply(lambda x : ' '.join(x)).to_frame()
df_new = (df[['1', 'Acronym']]
.drop_duplicates(subset=['Acronym'])
.merge(clean_meaning,
left_on='Acronym',
right_index=True))
[out]
1 Acronym Meaning
0 86 ABC Aaaaa Bbbbb Ccccc
1 87 CDE Ccccc Ddddd Eeeee
2 88 FGH Fffff Ggggg Hhhhh
5 91 IJK Iiiii Jjjjj Kkkkk
6 92 LMN Lllll Mmmmm Nnnnn
7 93 OPQ Ooooo Ppppp Qqqqq
8 94 RST Rrrrr Sssss Ttttt
9 95 UVZ Uuuuu Vvvvv Zzzzz
Upvotes: 2
Reputation: 323226
This is a pretty tricky question neither ffill
and bfill
will work for this question
s1=(~(df.Acronym.isnull()|df.Meaning.isnull())) # create the group
s=s1.astype(int).diff().ne(0).cumsum() # create the group for each bad line it will assign the single id
bad=df[~s1]# we just only change the bad one
good=df[s1]# keep the good one no change
bad=bad.groupby(s.loc[bad.index]).agg({'1':'first','Acronym':'first','Meaning':lambda x : ''.join(x[x.notnull()])})
pd.concat([good,bad]).sort_index()
Out[107]:
1 Acronym Meaning
0 86 ABC Aaaaa Bbbbb Ccccc
1 87 CDE Ccccc Ddddd Eeeee
2 88 FGH Fffff Ggggg Hhhhh
5 91 IJK Iiiii Jjjjj Kkkkk
6 92 LMN Lllll Mmmmm Nnnnn
7 93 OPQ Ooooo Ppppp Qqqqq
8 94 RST Rrrrr Sssss Ttttt
9 95 UVZ Uuuuu Vvvvv Zzzzz
Upvotes: 2