Reputation: 15
I want to create a subset where data with 3 or more identical columns are dropped, but the original is kept. I have 4 columns ['col1', 'col2', 'col3', 'col4'], and wrote the next code:
df1 = data_between_all.drop_duplicates(['col1', 'col2', 'col3']).copy()
df2 = data_between_all.drop_duplicates(['col1', 'col3', 'col4']).copy()
df3 = data_between_all.drop_duplicates(['col1', 'col2', 'col4']).copy()
df4 = data_between_all.drop_duplicates(['col2', 'col3', 'col4']).copy()
data_without_duplicates = df1.append([df2, df3, df4]).drop_duplicates()
Is there a most sofysticated way to do this?
Upvotes: 1
Views: 176
Reputation: 120439
You can use a simple loop to exclude one column at each iteration:
Setup a Minimal Reproducible Example
df = pd.DataFrame(np.random.randint(0, 2, (10, 4)),
columns=['col1', 'col2', 'col3', 'col4'])
print(df)
Sample data:
>>> df
col1 col2 col3 col4
0 1 1 1 1
1 1 1 1 1
2 0 0 0 1
3 0 1 1 0
4 1 0 0 0
5 1 1 0 1
6 0 0 1 0
7 0 1 0 0
8 0 0 0 0
9 1 0 1 0
data = []
for col in df.columns:
data.append(df.drop_duplicates(df.columns[df.columns != col]))
out = pd.concat(data).drop_duplicates()
Output result:
>>> out
col1 col2 col3 col4
0 1 1 1 1
2 0 0 0 1
3 0 1 1 0
4 1 0 0 0
5 1 1 0 1
6 0 0 1 0
7 0 1 0 0
9 1 0 1 0
Upvotes: 1
Reputation: 93181
You can use a loop instead of spelling out every combination:
# Generate random data
cols = [f'col{i}' for i in range(1,5)]
df = pd.DataFrame(np.random.randint(1, 10, (25, 4)), columns=cols)
# Process the data
from itertools import combinations
result = []
for combo in combinations(cols, 3):
result.append(df.drop_duplicates(list(combo)))
result = pd.concat(result)
Upvotes: 0