WolVes
WolVes

Reputation: 1336

How to merge rows up based on nan index value

I got a weird one today. I am scraping several thousand PDFs using Tabula-py and, for whatever reason, the same table (different PDF) which has wrapped text can be auto-merged based on the tables actual split but on other occasion the pandas dataframe will have many NaN rows to account account for the wrapped text. Generally the ratio is 50:1 are merged. So it makes since to automate the merging process. Here is the example:

Desired DataFrame:

    Column1      | Column2     | Column3
A  Many Many ...  Lots and ...  This keeps..
B  lots of text.. Many Texts..  Johns and jo..
C   ...
D

Scraped returned Dataframe

        Column1      | Column2     | Column3
    A  Many Many       Lots         This keeps Just
   Nan Many Many       and lots     Keeps Going!
   Nan Texts           Nan          Nan
    B  lots of        Many Texts    John and
   Nan text           here          Johnson inc.
    C  ...

In this case the text should be merged up, such that "Many Many Many Many Texts" are all in cell A Column1 and so on.

I have solved this problem with the below solution, but it feels very dirty. There are a ton of index settings to avoid having to manage the columns and avoid dropping needed values. Is anyone aware of a better solution?

df = df.reset_index()
df['Unnamed: 0'] = df['Unnamed: 0'].fillna(method='ffill')
df = df.fillna('')
df = df.set_index('Unnamed: 0')
df = df.groupby(index)[df.columns].transform(lambda x: ' '.join(x))
df = df.reset_index()
df = df.drop_duplicates(keep = 'first')
df = df.set_index('Unnamed: 0')

Cheers

Upvotes: 1

Views: 711

Answers (3)

Andy L.
Andy L.

Reputation: 25239

Try this:

df.fillna('').groupby(df.index.to_series().ffill()).agg(' '.join)


Out[1390]:
                              Column1          Column2  \
Unnamed: 0
A           Many Many Many Many Texts   Lots and lots
B                        lots of text  Many Texts here

                                  Column3
Unnamed: 0
A           This keeps Just Keeps Going!
B                   John and Johnson inc.

Upvotes: 2

Quang Hoang
Quang Hoang

Reputation: 150735

Similar to Ben's idea:

# fill the missing index
df.index = df.index.to_series().ffill()


(df.stack()               # stack to kill the other NaN values
    .groupby(level=(0,1)) # grouby (index, column)
    .apply(' '.join)      # join those strings
    .unstack(level=1)     # unstack to get columns back
)

Output:

                     Column1          Column2                       Column3
A  Many Many Many Many Texts    Lots and lots  This keeps Just Keeps Going!
B               lots of text  Many Texts here         John and Johnson inc.

Upvotes: 2

Ben.T
Ben.T

Reputation: 29635

I think you can use ffill on the index directly in the groupby. Then use agg instead of transform.

# dummy input
df = pd.DataFrame( {'a':list('abcdef'), 'b' : list('123456')}, 
                   index=['A', np.nan, np.nan, 'B', 'C', np.nan])
print (df)
     a  b
A    a  1
NaN  b  2
NaN  c  3
B    d  4
C    e  5
NaN  f  6
#then groupby on the filled index and agg
new_df = (df.fillna('')
            .groupby(pd.Series(df.index).ffill().values)[df.columns]
            .agg(lambda x: ' '.join(x)))
print (new_df)
       a      b
A  a b c  1 2 3
B      d      4
C    e f    5 6

Upvotes: 0

Related Questions