Reputation: 564
Now there are a lot of similar questions but most of them answer how to delete the duplicate columns. However, I want to know how can I make a list of tuples where each tuple contains the column names of duplicate columns. I am assuming that each column has a unique name. Just to further illustrate my question:
df = pd.DataFrame({'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]},
index = ['a1', 'a2', 'a3', 'a4', 'a5'])
then I want the output:
[('A', 'C'), ('B', 'D')]
And if you are feeling great today then also extend the same question to rows. How to get a list of tuples where each tuple contains duplicate rows.
Upvotes: 17
Views: 3528
Reputation: 450
Not using panda, just pure python :
data = {'A': [1, 2, 3, 4, 5],'B': [2, 4, 2, 1, 9],
'C': [1, 2, 3, 4, 5],'D': [2, 4, 2, 1, 9],
'E': [3, 4, 2, 1, 2],'F': [1, 1, 1, 1, 1]}
from collections import defaultdict
deduplicate = defaultdict(list)
for key, items in data.items():
deduplicate[tuple(items)].append(key) # cast to tuple because they are hashables but lists are not.
duplicates = list()
for vector, letters in deduplicate.items():
if len(letters) > 1:
duplicates.append(letters)
print(duplicates)
Using pandas :
import pandas
df = pandas.DataFrame(data)
duplicates = []
dedup2 = defaultdict(list)
for key in df.columns:
dedup2[tuple(df[key])].append(key)
duplicates = list()
for vector, letters in dedup2.items():
if len(letters) > 1:
duplicates.append(letters)
print(duplicates)
Not really nice, but may be quicker since everything is done in one iteration over the data.
dedup2 = defaultdict(list)
duplicates = {}
for key in df.columns:
astup = tuple(df[key])
duplic = dedup2[astup]
duplic.append(key)
if len(duplic) > 1:
duplicates[astup] = duplic
duplicates = duplicates.values()
print(duplicates)
Upvotes: 4
Reputation: 152657
This is another approach that uses pure Python:
from operator import itemgetter
from itertools import groupby
def myfunc(df):
# Convert the dataframe to a list of list including the column name
zipped = zip(df.columns, df.values.T.tolist())
# Sort the columns (so they can be grouped)
zipped_sorted = sorted(zipped, key=itemgetter(1))
# Placeholder for the result
res = []
res_append = res.append
# Find duplicated columns using itertools.groupby
for k, grp in groupby(zipped_sorted, itemgetter(1)):
grp = list(grp)
if len(grp) > 1:
res_append(tuple(map(itemgetter(0), grp)))
return res
I included some inline comments that illustrate how it works, but basically this just sorts the input so identical columns are adjacent and then it groups them.
I did some superficial timings using Divakars timing setup and got the following:
%timeit group_duplicate_cols(df)
391 ms ± 25.8 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit myfunc(df)
572 ms ± 4.36 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So it seems like only 2 times slower than a NumPy approach, which is actually amazing.
Upvotes: 2
Reputation: 221574
Here's one NumPy approach -
def group_duplicate_cols(df):
a = df.values
sidx = np.lexsort(a)
b = a[:,sidx]
m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = df.columns[sidx].tolist()
return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
Sample runs -
In [100]: df
Out[100]:
A B C D E F
a1 1 2 1 2 3 1
a2 2 4 2 4 4 1
a3 3 2 3 2 2 1
a4 4 1 4 1 1 1
a5 5 9 5 9 2 1
In [101]: group_duplicate_cols(df)
Out[101]: [['A', 'C'], ['B', 'D']]
# Let's add one more duplicate into group containing 'A'
In [102]: df.F = df.A
In [103]: group_duplicate_cols(df)
Out[103]: [['A', 'C', 'F'], ['B', 'D']]
Converting to do the same, but for rows(index), we just need to switch the operations along the other axis, like so -
def group_duplicate_rows(df):
a = df.values
sidx = np.lexsort(a.T)
b = a[sidx]
m = np.concatenate(([False], (b[1:] == b[:-1]).all(1), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = df.index[sidx].tolist()
return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
Sample run -
In [260]: df2
Out[260]:
a1 a2 a3 a4 a5
A 3 5 3 4 5
B 1 1 1 1 1
C 3 5 3 4 5
D 2 9 2 1 9
E 2 2 2 1 2
F 1 1 1 1 1
In [261]: group_duplicate_rows(df2)
Out[261]: [['B', 'F'], ['A', 'C']]
Approaches -
# @John Galt's soln-1
from itertools import combinations
def combinations_app(df):
return[x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
# @Abdou's soln
def pandas_groupby_app(df):
return [tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]
# @COLDSPEED's soln
def triu_app(df):
c = df.columns.tolist()
i, j = np.triu_indices(len(c), 1)
x = [(c[_i], c[_j]) for _i, _j in zip(i, j) if (df[c[_i]] == df[c[_j]]).all()]
return x
# @cmaher's soln
def lambda_set_app(df):
return list(filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns]))))
Note : @John Galt's soln-2
wasn't included because the inputs being of size (8000,500)
would blow up with the proposed broadcasting
for that one.
Timings -
In [179]: # Setup inputs with sizes as mentioned in the question
...: df = pd.DataFrame(np.random.randint(0,10,(8000,500)))
...: df.columns = ['C'+str(i) for i in range(df.shape[1])]
...: idx0 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
...: idx1 = np.random.choice(df.shape[1], df.shape[1]//2,replace=0)
...: df.iloc[:,idx0] = df.iloc[:,idx1].values
...:
# @John Galt's soln-1
In [180]: %timeit combinations_app(df)
1 loops, best of 3: 24.6 s per loop
# @Abdou's soln
In [181]: %timeit pandas_groupby_app(df)
1 loops, best of 3: 3.81 s per loop
# @COLDSPEED's soln
In [182]: %timeit triu_app(df)
1 loops, best of 3: 25.5 s per loop
# @cmaher's soln
In [183]: %timeit lambda_set_app(df)
1 loops, best of 3: 27.1 s per loop
# Proposed in this post
In [184]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 188 ms per loop
Super boost with NumPy's view functionality
Leveraging NumPy's view functionality that lets us view each group of elements as one dtype, we could gain further noticeable performance boost, like so -
def view1D(a): # a is array
a = np.ascontiguousarray(a)
void_dt = np.dtype((np.void, a.dtype.itemsize * a.shape[1]))
return a.view(void_dt).ravel()
def group_duplicate_cols_v2(df):
a = df.values
sidx = view1D(a.T).argsort()
b = a[:,sidx]
m = np.concatenate(([False], (b[:,1:] == b[:,:-1]).all(0), [False] ))
idx = np.flatnonzero(m[1:] != m[:-1])
C = df.columns[sidx].tolist()
return [C[i:j] for i,j in zip(idx[::2],idx[1::2]+1)]
Timings -
In [322]: %timeit group_duplicate_cols(df)
10 loops, best of 3: 185 ms per loop
In [323]: %timeit group_duplicate_cols_v2(df)
10 loops, best of 3: 69.3 ms per loop
Just crazy speedups!
Upvotes: 9
Reputation: 76927
Here's a single-liner
In [22]: from itertools import combinations
In [23]: [x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
Out[23]: [('A', 'C'), ('B', 'D')]
Alternatively, using NumPy broadcasting. Better, look at Divakar's solution
In [124]: cols = df.columns
In [125]: dftv = df.T.values
In [126]: cross = pd.DataFrame((dftv == dftv[:, None]).all(-1), cols, cols)
In [127]: cross
Out[127]:
A B C D E F
A True False True False False False
B False True False True False False
C True False True False False False
D False True False True False False
E False False False False True False
F False False False False False True
# Only take values from lower triangle
In [128]: s = cross.where(np.tri(*cross.shape, k=-1)).unstack()
In [129]: s[s == 1].index.tolist()
Out[129]: [('A', 'C'), ('B', 'D')]
Upvotes: 6
Reputation: 5215
Here's one more option using only comprehensions/built-ins:
filter(lambda x: len(x) > 1, list(set([tuple([x for x in df.columns if all(df[x] == df[y])]) for y in df.columns])))
Result:
[('A', 'C'), ('B', 'D')]
Upvotes: 0
Reputation: 13274
This should also do:
[tuple(d.index) for _,d in df.T.groupby(list(df.T.columns)) if len(d) > 1]
Yields:
# [('A', 'C'), ('B', 'D')]
Upvotes: 6
Reputation: 2424
Based on @John Galt one liner which is like this:
result_col = [x for x in combinations(df.columns, 2) if (df[x[0]] == df[x[-1]]).all()]
you can get the result_row
as follows:
result_row = [x for x in combinations(df.T.columns,2) if (df.T[x[0]] == df.T[x[-1]]).all()]
using transpose (df.T)
Upvotes: 1