Pearl
Pearl

Reputation: 452

How to merge dataframe rows based on empty cells in a column

If I have the following dataframe:

Index Col1 Col2 Col3
1     10   x    40
2          y    50
3          z    60
4     20   a    30

I would like to merge rows that have a blank Col1 with the previous row that is not blank in Col1.

Expected output:

Index Col1 Col2  Col3
1     10   x,y,z 40,50,60
4     20   a     30

Is this possible?

Thanks

Upvotes: 2

Views: 1591

Answers (3)

Pearl
Pearl

Reputation: 452

The answers posted answered my 'dumbed down' dataset question but I could not get them to work in my situation with my real world dataset. I had posted another question prior to this one concerning fixing my problem while pulling the data in not manipulating the data once in and an answer was formed from that answering this question as well.

It is here

The answer is:

last_valid = None

        check_cols = ['Col1']  # if only need to check a subset of cols for validity, do it here
       

        df = df.astype(str) #convert all columns to strings as I have to combine numbers in the same cell

        df = df.replace('nan','') #get rid of the nan created back to a blank string

        for i, s in df.iterrows():  # This is slow, but probably necessary in this case
           

            """ If all the rows are valid, we want to keep it as a reference in case

            the following rows are not """

            if all(s[check_cols] != ''):

                lvi, last_valid = i, s

                # need to store index and series so we can go back and replace it

                continue

            else:  # here is the critical part

                extra_vals = s[s != '']  # find cells in row that have actual values

                for col in extra_vals.index:

                    """ I'm creating a list and appending here since I don't know

                    your values or how they need to be handled exactly"""

                    last_valid[col] = last_valid[col] + "," + extra_vals[col] #separate by whatever you wish, list was causing issues

                # replace that row in the dataframe

                df.iloc[lvi, :] = last_valid

 

        # drop extra rows:

        df = df[df['Col1'] != ''].reset_index(drop=True)

Upvotes: 0

BENY
BENY

Reputation: 323396

We can do

out = df.drop(labels = 'Col1',axis = 1).astype(str).groupby(df['Col1'].mask(df['Col1']=='').ffill()).agg(','.join).reset_index()
Out[85]: 
   Col1   Col2      Col3
0  10.0  x,y,z  40,50,60
1  20.0      a        30

Upvotes: 2

Umar.H
Umar.H

Reputation: 23099

Quite possible, what you need to do is create a unique group value that increments at every value that is not null.

in one go

df.drop('Col1',axis=1).groupby((df['Col1'].isna()==False).cumsum()).agg(list)
#p.s if really want strings use
#df.drop('Col1',axis=1).groupby((df['Col1'].isna()==False
#                      ).cumsum()).astype(str).agg(','.join)


           Col2          Col3
Col1
1     [x, y, z]  [40, 50, 60]
2           [a]          [30]

the key here is the condition:

df[['Col1']].assign(con=df['Col1'].isna()==False)

   Col1    con #for condition
0  10.0   True <-- first group
1   NaN  False
2   NaN  False
3  20.0   True <-- second group

now, creating a cumulative sum allows you to create your grouper object.

df[['Col1']].assign(con=(df['Col1'].isna()==False).cumsum())


   Col1  con
0  10.0    1
1   NaN    1
2   NaN    1
3  20.0    2

Upvotes: 2

Related Questions