kfp_ny
kfp_ny

Reputation: 39

pandas Grouping based on two variables

+---------+---------+-------+
| g_var_1 | g_var_2 | group |
+---------+---------+-------+
| A       | B       | 1     |
+---------+---------+-------+
| B       | A       | 1     |
+---------+---------+-------+
| C       | D       | 2     |
+---------+---------+-------+
| D       | C       | 2     |
+---------+---------+-------+
| E       | F       | 3     |
+---------+---------+-------+
| F       | E       | 3     |
+---------+---------+-------+
| G       | H       | 4     |
+---------+---------+-------+
| H       | G       | 4     |
+---------+---------+-------+

Using pandas: I am trying to create a "group" variable based on "g_var_1" and "g_var_2". As you can see from the above ASCII table, the logic is that the same combinations of "g_var_1" and g_var_2" are grouped together. So observations with (g_var_1 == "A" and g_var_2 == "B") would be in the same group as observations with (g_var_1 == "B" and g_var_2 == "A").

The dataset that I am working with has more than a thousand rows, so doing this manually is not an optimal solution for me.

Any help would be greatly appreciated. Thanks in advance!

Upvotes: 1

Views: 27

Answers (2)

filbranden
filbranden

Reputation: 8898

Create a column with a combined g_var that stores a tuple including the two values.

In order to make them compare the same, keep the tuple sorted, so "B", "A" will also become ("A", "B").

You can use the apply() method, using a custom function, to achieve that result.

def make_tuple_var(row):
    var1 = row['g_var_1']
    var2 = row['g_var_2']
    if var1 > var2:
        # Swap them.
        var1, var2 = var2, var1
    return (var1, var2)

Then you can use it in:

df['g_var'] = df.apply(make_tuple_var, axis=1)

And finally you can groupby('g_var'), which will group all equivalent pairs together.

Upvotes: 1

BENY
BENY

Reputation: 323326

First sort then use ngroup with groupby

l=['g_var_1','g_var_2']
pd.DataFrame(np.sort(df[l],1),columns=l).groupby(l).ngroup().add(1)
Out[340]: 
0    1
1    1
2    2
3    2
4    3
5    3
6    4
7    4
dtype: int64
df['group']=pd.DataFrame(np.sort(df[l],1),columns=l).groupby(l).ngroup().add(1)

.values

Upvotes: 2

Related Questions