Pandas: Split dataframe with duplicate values into dataframe with unique values

Question

I have a dataframe in Pandas with duplicate values in Col1:

Col1
a
a
b
a
a
b

What I want to do is to split this df into different df-s with unique Col1 values in each.

DF1:

Col1
a
b

DF2:

Col1
a
b

DF3:

Col1
a

DF4:

Col1
a

Any suggestions ?

mozway · Accepted Answer

I don't think you can achieve this in a vectorial way.

One possibility is to use a custom function to iterate the items and keep track of the unique ones. Then use this to split with groupby:

def cum_uniq(s):
    i = 0
    seen = set()
    out = []
    for x in s:
        if x in seen:
            i+=1
            seen = set()
        out.append(i)
        seen.add(x)
    return pd.Series(out, index=s.index)

out = [g for _,g in df.groupby(cum_uniq(df['Col1']))]

output:

[  Col1
 0    a,
   Col1
 1    a
 2    b,
   Col1
 3    a,
   Col1
 4    a
 5    b]

intermediate:

cum_uniq(df['Col1'])

0    0
1    1
2    1
3    2
4    3
5    3
dtype: int64

if order doesn't matter

Let's ad a Col2 to the example:

  Col1  Col2
0    a     0
1    a     1
2    b     2
3    a     3
4    a     4
5    b     5

the previous code gives:

[  Col1  Col2
 0    a     0,
   Col1  Col2
 1    a     1
 2    b     2,
   Col1  Col2
 3    a     3,
   Col1  Col2
 4    a     4
 5    b     5]

If order does not matter, you can vectorize it:

out = [g for _,g in df.groupby(df.groupby('Col1').cumcount())]

output:

[  Col1  Col2
0    a     0
2    b     2,
   Col1  Col2
1    a     1
5    b     5,
   Col1  Col2
3    a     3,
   Col1  Col2
4    a     4]

Pandas: Split dataframe with duplicate values into dataframe with unique values

Answers (1)

if order doesn't matter

Related Questions