Reputation: 475
Assuming a sample dataframe:
Chemical Compound Name
0 Alcohol Ethanol Liquor
1 Hooch NaN Liquor
2 Cerveza Ethanol NaN
3 Bauxite Aluminium Gibbsite
4 Feldspar Aluminium NaN
What is the efficient way to replace or identify if two rows are identical? (Assuming two rows are identical if any attribute(column) match and not necessarily all of them)
The resulting could be either:
Chemical Compound Name
0 Alcohol Ethanol Liquor
1 Alcohol NaN Liquor
2 Alcohol Ethanol NaN
3 Bauxite Aluminium Gibbsite
4 Bauxite Aluminium NaN
or:
Chemical Compound Name Identifier
0 Alcohol Ethanol Liquor Alcohol
1 Hooch NaN Liquor Alcohol
2 Cerveza Ethanol NaN Alcohol
3 Bauxite Aluminium Gibbsite Bauxite
4 Feldspar Aluminium NaN Bauxite
Upvotes: 1
Views: 349
Reputation: 353419
This is a set consolidation / connected components / union-find question in disguise.
If we arbitrarily decide to view it as a connected components problem, we can imagine every word in your frame as a node. A a row basically says that the elements there are equivalent, or in other words, reachable: there are edges between the nodes. To determine the set of synonyms, we need to find the connected components of the graph.
import networkx as nx
G = nx.from_pandas_dataframe(df.stack().reset_index(), source='level_0', target=0)
codes = {v: i for i, vv in enumerate(nx.connected_components(G)) for v in vv}
df["Identifier"] = df["Chemical"].groupby(df["Chemical"].replace(codes)).transform("first")
gives me
In [229]: df
Out[229]:
Chemical Compound Name Identifier
0 Alcohol Ethanol Liquor Alcohol
1 Hooch NaN Liquor Alcohol
2 Cerveza Ethanol NaN Alcohol
3 Bauxite Aluminium NaN Bauxite
4 Feldspar Aluminium NaN Bauxite
because once we make the graph with edges (equivalencies)
In [233]: G.edges()
Out[233]:
[(0, 'Alcohol'),
(0, 'Ethanol'),
(0, 'Liquor'),
('Ethanol', 2),
('Liquor', 1),
(1, 'Hooch'),
(2, 'Cerveza'),
(3, 'Bauxite'),
(3, 'Aluminium'),
('Aluminium', 4),
(4, 'Feldspar')]
we can ask networkx to find the groups:
In [234]: list(nx.connected_components(G))
Out[234]:
[{0, 1, 2, 'Alcohol', 'Cerveza', 'Ethanol', 'Hooch', 'Liquor'},
{3, 4, 'Aluminium', 'Bauxite', 'Feldspar'}]
And then the rest is just turning these into numbers and arbitrarily choosing to use the first Chemical entry as the name of each group.
We could do exactly the same thing by using scipy's scipy.sparse.csgraph.connected_components function, with a little bit more setup, or simply use an off-the-shelf set consolidation algorithm to find the groups. For example, using the set consolidation algorithm here, we could do
In [240]: consolidate([set(row.dropna()) for _, row in df.iterrows()])
Out[240]:
[{'Alcohol', 'Cerveza', 'Ethanol', 'Hooch', 'Liquor'},
{'Aluminium', 'Bauxite', 'Feldspar'}]
and once again we have the groups we need.
Upvotes: 3
Reputation: 109686
To identify rows that have at least one matching column:
>>> df.apply(lambda x: x.dropna().duplicated()).any(axis=1)
0 False
1 True
2 True
3 False
4 True
dtype: bool
In the above, rows 1, 2 and 4 are 'duplicates'. Row 1: Liquor, Row 2: Ethanol, and Row 4: Aluminum.
I'm not clear about your fill logic, however.
Upvotes: 1