Matter
Matter

Reputation: 1

How to reverse explode dataframe rows

I have been struggling to reverse explode a dataframe that I have initially exploded. I needed to split and exploded it at first so that I can check each value in this dataframe (df1) exist in the othere dataframe (df2). There are columns that contains semicolon separated strings so I needed to split them so I can check their availability on df2 one-by-one. I have seen similar questions that I tried their solutions but the dataframe on these questions do not have empty cells (empty strings).

Example original df1 (Before exploded):

enter image description here

After exploded:

enter image description here

from itertools import product, chain
import pandas as pd

explode_rows = chain.from_iterable(product([Col1], [Col2], Col3.split(;), Col4.split(;), Col5.split(;)) 
            for Col1, Col2, Col3, Col4, Col5 in df1.to_numpy())
df1 = pd.DataFrame(explode_rows, columns = df1.columns)

Now I need to reverse explode this split column values so that I can easily filter df2 with a list ['a','b','c'] at once where never needed.

This what i get when trying to reverse the explode. col3,4,5 will have empty cells something as like in this example. but I always have stuff like [], [''], '','' in this empty cells which i do not want as when I try to filter df2, this things are used and cannot filter with them. So i need to reverse explode and leave empty cell empty as they were. Please not that this empty cells actually have empty strings not Nan.

This is some of the things I tried, but almost give me the same output as below.

df1 = df1.groupby(['Col1','Col2'], as_index=False).agg(list)


df1 = (df1.groupby(['Col1','Col2'])
        .agg(Col3=('Col3', ','.join(),
             Col4=('Col4', ','.join),
             Col5=('Col5', ','.join),)
        .reset_index())
    

df1 = df1.groupby(['Col1','Col2'], as_index=False).agg(list)


df1 = (df1.groupby(['Col1','Col2'])
        .agg(Col3=('Col3', ','.join(),
             Col4=('Col4', ','.join),
             Col5=('Col5', ','.join),)
        .reset_index())
        

df1 = (df1.groupby([['Col1','Col2'])
      .agg({'Col3': lambda x: x.tolist(),
            'Col4': lambda x: x.tolist(),
            'Col3': lambda x: x.tolist(),})
      .reset_index())

I have tried to replace the empty strings with Nan then reverse explode and try to replace them back to empty strings with no luck.

enter image description here Any help will be appreciated, as well as the good practices that I can you for efficiency and clean pythonic way.

Upvotes: 0

Views: 1759

Answers (2)

Arhiliuc Cristina
Arhiliuc Cristina

Reputation: 323

A simple solution would be to continue with:

df1["Col3"] == df1["Col3"].apply(";".join)
df1["Col4"] == df1["Col4"].apply(";".join)
df1["Col5"] == df1["Col5"].apply(";".join)

Upvotes: 0

prof.xed
prof.xed

Reputation: 118

I've created a general purpose solution for this in which it will work on any other dataframe.

I would say you only need the implode part of it.

def explosive_df(df, column: str):
    def implode(data_frame=()):
        inner_df = data_frame if len(data_frame) > 0 else df

        def reduce_columns():
            acc = {}
            for col in inner_df.columns:
                if col == column:
                    acc[col] = lambda x: x.dropna().tolist()
                    continue
                acc[col] = lambda x: x.iloc[0]

            return acc

        return inner_df.groupby(inner_df.index, as_index=False).agg(
            reduce_columns()
        ).reset_index(drop=True)

    return df.explode(column), implode

Usage:

exploded, implode = explosive_df(data_frame, 'column_name')

imploded = implode(exploded)

the imploded part is supposed to look exactly the same as the data_frame if no edits happened before imploding.

Try it out:

print(imploded.sort_index() == data_frame.sort_index())

And it is supposed to explode and implode any dataframe I could think of.

Upvotes: 0

Related Questions