Reputation: 39
+---------+---------+-------+
| g_var_1 | g_var_2 | group |
+---------+---------+-------+
| A | B | 1 |
+---------+---------+-------+
| B | A | 1 |
+---------+---------+-------+
| C | D | 2 |
+---------+---------+-------+
| D | C | 2 |
+---------+---------+-------+
| E | F | 3 |
+---------+---------+-------+
| F | E | 3 |
+---------+---------+-------+
| G | H | 4 |
+---------+---------+-------+
| H | G | 4 |
+---------+---------+-------+
Using pandas: I am trying to create a "group" variable based on "g_var_1" and "g_var_2". As you can see from the above ASCII table, the logic is that the same combinations of "g_var_1" and g_var_2" are grouped together. So observations with (g_var_1 == "A" and g_var_2 == "B") would be in the same group as observations with (g_var_1 == "B" and g_var_2 == "A").
The dataset that I am working with has more than a thousand rows, so doing this manually is not an optimal solution for me.
Any help would be greatly appreciated. Thanks in advance!
Upvotes: 1
Views: 27
Reputation: 8898
Create a column with a combined g_var
that stores a tuple including the two values.
In order to make them compare the same, keep the tuple sorted, so "B", "A"
will also become ("A", "B")
.
You can use the apply()
method, using a custom function, to achieve that result.
def make_tuple_var(row):
var1 = row['g_var_1']
var2 = row['g_var_2']
if var1 > var2:
# Swap them.
var1, var2 = var2, var1
return (var1, var2)
Then you can use it in:
df['g_var'] = df.apply(make_tuple_var, axis=1)
And finally you can groupby('g_var')
, which will group all equivalent pairs together.
Upvotes: 1
Reputation: 323326
First sort
then use ngroup
with groupby
l=['g_var_1','g_var_2']
pd.DataFrame(np.sort(df[l],1),columns=l).groupby(l).ngroup().add(1)
Out[340]:
0 1
1 1
2 2
3 2
4 3
5 3
6 4
7 4
dtype: int64
df['group']=pd.DataFrame(np.sort(df[l],1),columns=l).groupby(l).ngroup().add(1)
.values
Upvotes: 2