Nachon
Nachon

Reputation: 15

Drop duplicates based on multiple columns

I want to create a subset where data with 3 or more identical columns are dropped, but the original is kept. I have 4 columns ['col1', 'col2', 'col3', 'col4'], and wrote the next code:

df1 = data_between_all.drop_duplicates(['col1', 'col2', 'col3']).copy()
df2 = data_between_all.drop_duplicates(['col1', 'col3', 'col4']).copy()
df3 = data_between_all.drop_duplicates(['col1', 'col2', 'col4']).copy()
df4 = data_between_all.drop_duplicates(['col2', 'col3', 'col4']).copy()

data_without_duplicates = df1.append([df2, df3, df4]).drop_duplicates()

Is there a most sofysticated way to do this?

Upvotes: 1

Views: 176

Answers (2)

Corralien
Corralien

Reputation: 120439

You can use a simple loop to exclude one column at each iteration:

Setup a Minimal Reproducible Example

df = pd.DataFrame(np.random.randint(0, 2, (10, 4)),
                  columns=['col1', 'col2', 'col3', 'col4'])
print(df)

Sample data:

>>> df
   col1  col2  col3  col4
0     1     1     1     1
1     1     1     1     1
2     0     0     0     1
3     0     1     1     0
4     1     0     0     0
5     1     1     0     1
6     0     0     1     0
7     0     1     0     0
8     0     0     0     0
9     1     0     1     0
data = []
for col in df.columns:
    data.append(df.drop_duplicates(df.columns[df.columns != col]))
out = pd.concat(data).drop_duplicates()

Output result:

>>> out
   col1  col2  col3  col4
0     1     1     1     1
2     0     0     0     1
3     0     1     1     0
4     1     0     0     0
5     1     1     0     1
6     0     0     1     0
7     0     1     0     0
9     1     0     1     0

Upvotes: 1

Code Different
Code Different

Reputation: 93181

You can use a loop instead of spelling out every combination:

# Generate random data
cols = [f'col{i}' for i in range(1,5)]
df = pd.DataFrame(np.random.randint(1, 10, (25, 4)), columns=cols)

# Process the data
from itertools import combinations

result = []
for combo in combinations(cols, 3):
    result.append(df.drop_duplicates(list(combo)))

result = pd.concat(result)

Upvotes: 0

Related Questions