outflanker
outflanker

Reputation: 475

Replace column value of Pandas with another if any single attribute match (identify if two columns have common attribute)

Assuming a sample dataframe:

   Chemical   Compound     Name
0   Alcohol    Ethanol   Liquor
1     Hooch        NaN   Liquor
2   Cerveza    Ethanol      NaN
3   Bauxite  Aluminium Gibbsite
4  Feldspar  Aluminium      NaN

What is the efficient way to replace or identify if two rows are identical? (Assuming two rows are identical if any attribute(column) match and not necessarily all of them)

The resulting could be either:

   Chemical   Compound     Name
0   Alcohol    Ethanol   Liquor
1   Alcohol        NaN   Liquor
2   Alcohol    Ethanol      NaN
3   Bauxite  Aluminium Gibbsite
4   Bauxite  Aluminium      NaN

or:

   Chemical   Compound     Name Identifier
0   Alcohol    Ethanol   Liquor    Alcohol
1     Hooch        NaN   Liquor    Alcohol
2   Cerveza    Ethanol      NaN    Alcohol
3   Bauxite  Aluminium Gibbsite    Bauxite
4  Feldspar  Aluminium      NaN    Bauxite

Upvotes: 1

Views: 349

Answers (2)

DSM
DSM

Reputation: 353419

This is a set consolidation / connected components / union-find question in disguise.

If we arbitrarily decide to view it as a connected components problem, we can imagine every word in your frame as a node. A a row basically says that the elements there are equivalent, or in other words, reachable: there are edges between the nodes. To determine the set of synonyms, we need to find the connected components of the graph.

import networkx as nx
G = nx.from_pandas_dataframe(df.stack().reset_index(), source='level_0', target=0)
codes = {v: i for i, vv in enumerate(nx.connected_components(G)) for v in vv}
df["Identifier"] = df["Chemical"].groupby(df["Chemical"].replace(codes)).transform("first")

gives me

In [229]: df
Out[229]: 
   Chemical   Compound    Name Identifier
0   Alcohol    Ethanol  Liquor    Alcohol
1     Hooch        NaN  Liquor    Alcohol
2   Cerveza    Ethanol     NaN    Alcohol
3   Bauxite  Aluminium     NaN    Bauxite
4  Feldspar  Aluminium     NaN    Bauxite

because once we make the graph with edges (equivalencies)

In [233]: G.edges()
Out[233]: 
[(0, 'Alcohol'),
 (0, 'Ethanol'),
 (0, 'Liquor'),
 ('Ethanol', 2),
 ('Liquor', 1),
 (1, 'Hooch'),
 (2, 'Cerveza'),
 (3, 'Bauxite'),
 (3, 'Aluminium'),
 ('Aluminium', 4),
 (4, 'Feldspar')]

we can ask networkx to find the groups:

In [234]: list(nx.connected_components(G))
Out[234]: 
[{0, 1, 2, 'Alcohol', 'Cerveza', 'Ethanol', 'Hooch', 'Liquor'},
 {3, 4, 'Aluminium', 'Bauxite', 'Feldspar'}]

And then the rest is just turning these into numbers and arbitrarily choosing to use the first Chemical entry as the name of each group.

We could do exactly the same thing by using scipy's scipy.sparse.csgraph.connected_components function, with a little bit more setup, or simply use an off-the-shelf set consolidation algorithm to find the groups. For example, using the set consolidation algorithm here, we could do

In [240]: consolidate([set(row.dropna()) for _, row in df.iterrows()])
Out[240]: 
[{'Alcohol', 'Cerveza', 'Ethanol', 'Hooch', 'Liquor'},
 {'Aluminium', 'Bauxite', 'Feldspar'}]

and once again we have the groups we need.

Upvotes: 3

Alexander
Alexander

Reputation: 109686

To identify rows that have at least one matching column:

>>> df.apply(lambda x: x.dropna().duplicated()).any(axis=1)
0    False
1     True
2     True
3    False
4     True
dtype: bool

In the above, rows 1, 2 and 4 are 'duplicates'. Row 1: Liquor, Row 2: Ethanol, and Row 4: Aluminum.

I'm not clear about your fill logic, however.

Upvotes: 1

Related Questions