sunny
sunny

Reputation: 653

Adding a new column based on frequency counts of groups

Apologies if my title is not very descriptive. But this is what I am trying to achieve:-

I have a dataframe with 2 fields, namely, src port and destination port for a tcp traffic. The csv dump of the dataframe looks something like this:-

srcp    dstp
55647    22
22       55670
2222    56354
55670   22
55670   22
2222    56354
56362   139
22      55670
22      55670
56354   2222
22      55670
56354   2222

The coln 1 is the src port and coln 2 is the destination port.

I want to look at the pairs src port- destination port and get their frequency and store it in a new coln frequency . In other words, for the example above, I want do this:-

  55647 22    1
  22   55670  6
  2222 56354  4
  55670 22    6
  55670 22    6
  2222  56354 4
  56362 139   1
  22    55670 6
  22    55670 6
  56354 2222  4
  22    55670 6
  56354 2222  4

As you can see that ports 2222-5634 has a frequency of 4 ( since traffic can flow in either direction and hence they need to identified as one connection) and therefore a value of 4 is repeated in the frequency coln cell corresponding to the value 2222-56354 in the either of the destination/src port.

Since still climbing the ladder in python, I am wondering how do I achieve this.

Upvotes: 1

Views: 2204

Answers (3)

cs95
cs95

Reputation: 402854

Option 1

Using np.sort, but without df.apply

df['freq'] = pd.DataFrame(np.sort(df.values, 1), columns=df.columns)\
                       .groupby(['srcp', 'dstp'])['srcp'].transform('count')
df

     srcp   dstp  freq
0   55647     22     1
1      22  55670     6
2    2222  56354     4
3   55670     22     6
4   55670     22     6
5    2222  56354     4
6   56362    139     1
7      22  55670     6
8      22  55670     6
9   56354   2222     4
10     22  55670     6
11  56354   2222     4

Option 2

You can also approach this problem using frozensets

df2 = df.apply(frozenset, 1).reset_index()    
s = df2.groupby(df2.columns[-1]).index.transform('count')
print(s)

0     1
1     6
2     4
3     6
4     6
5     4
6     1
7     6
8     6
9     4
10    6
11    4

df['freq'] = s
df

     srcp   dstp  freq
0   55647     22     1
1      22  55670     6
2    2222  56354     4
3   55670     22     6
4   55670     22     6
5    2222  56354     4
6   56362    139     1
7      22  55670     6
8      22  55670     6
9   56354   2222     4
10     22  55670     6
11  56354   2222     4

Performance

Small

1000 loops, best of 3: 1.82 ms per loop  # jezrael #1
1000 loops, best of 3: 1.84 ms per loop  # jezrael #2
1000 loops, best of 3: 1.78 ms per loop  # mine #1
100 loops, best of 3: 2.6 ms per loop    # mine #2    
100 loops, best of 3: 3.96 ms per loop   # John Galt

Large (df * 10000)

100 loops, best of 3: 12.1 ms per loop   # jezrael #1
100 loops, best of 3: 11.9 ms per loop   # jezrael #2
100 loops, best of 3: 11.9 ms per loop   # mine #1
1 loop, best of 3: 3.51 s per loop       # mine #2 
1 loop, best of 3: 14.8 s per loop       # John Galt

Upvotes: 1

jezrael
jezrael

Reputation: 863236

First sort all values per rows by numpy.sort and then groupby with transform function size:

Notice: Difference between size and count is count NOT count NaN values.

df['a'] = pd.DataFrame(np.sort(df.values, 1)).groupby([0,1])[0].transform('size')
print (df)
     srcp   dstp  a
0   55647     22  1
1      22  55670  6
2    2222  56354  4
3   55670     22  6
4   55670     22  6
5    2222  56354  4
6   56362    139  1
7      22  55670  6
8      22  55670  6
9   56354   2222  4
10     22  55670  6
11  56354   2222  4

Similar:

df['a'] =  pd.DataFrame(np.sort(df.values, 1))
             .groupby(list(range(len(df.columns))))[0]
             .transform('size')
print (df)

     srcp   dstp  a
0   55647     22  1
1      22  55670  6
2    2222  56354  4
3   55670     22  6
4   55670     22  6
5    2222  56354  4
6   56362    139  1
7      22  55670  6
8      22  55670  6
9   56354   2222  4
10     22  55670  6
11  56354   2222  4

Upvotes: 1

Zero
Zero

Reputation: 76957

You could use sort and groupby on columns

In [1923]: df['freq'] = (df.apply(np.sort, 1)
                           .groupby(['srcp', 'dstp'])['srcp']
                           .transform('size'))

In [1924]: df
Out[1924]:
     srcp   dstp  freq
0   55647     22     1
1      22  55670     6
2    2222  56354     4
3   55670     22     6
4   55670     22     6
5    2222  56354     4
6   56362    139     1
7      22  55670     6
8      22  55670     6
9   56354   2222     4
10     22  55670     6
11  56354   2222     4

Upvotes: 1

Related Questions