Reputation: 653
Apologies if my title is not very descriptive. But this is what I am trying to achieve:-
I have a dataframe with 2 fields, namely, src port and destination port for a tcp traffic. The csv dump of the dataframe looks something like this:-
srcp dstp
55647 22
22 55670
2222 56354
55670 22
55670 22
2222 56354
56362 139
22 55670
22 55670
56354 2222
22 55670
56354 2222
The coln 1 is the src port and coln 2 is the destination port.
I want to look at the pairs src port- destination port and get their frequency and store it in a new coln frequency . In other words, for the example above, I want do this:-
55647 22 1
22 55670 6
2222 56354 4
55670 22 6
55670 22 6
2222 56354 4
56362 139 1
22 55670 6
22 55670 6
56354 2222 4
22 55670 6
56354 2222 4
As you can see that ports 2222-5634 has a frequency of 4 ( since traffic can flow in either direction and hence they need to identified as one connection) and therefore a value of 4 is repeated in the frequency coln cell corresponding to the value 2222-56354 in the either of the destination/src port.
Since still climbing the ladder in python, I am wondering how do I achieve this.
Upvotes: 1
Views: 2204
Reputation: 402854
Option 1
Using np.sort
, but without df.apply
df['freq'] = pd.DataFrame(np.sort(df.values, 1), columns=df.columns)\
.groupby(['srcp', 'dstp'])['srcp'].transform('count')
df
srcp dstp freq
0 55647 22 1
1 22 55670 6
2 2222 56354 4
3 55670 22 6
4 55670 22 6
5 2222 56354 4
6 56362 139 1
7 22 55670 6
8 22 55670 6
9 56354 2222 4
10 22 55670 6
11 56354 2222 4
Option 2
You can also approach this problem using frozenset
s
df2 = df.apply(frozenset, 1).reset_index()
s = df2.groupby(df2.columns[-1]).index.transform('count')
print(s)
0 1
1 6
2 4
3 6
4 6
5 4
6 1
7 6
8 6
9 4
10 6
11 4
df['freq'] = s
df
srcp dstp freq
0 55647 22 1
1 22 55670 6
2 2222 56354 4
3 55670 22 6
4 55670 22 6
5 2222 56354 4
6 56362 139 1
7 22 55670 6
8 22 55670 6
9 56354 2222 4
10 22 55670 6
11 56354 2222 4
Performance
1000 loops, best of 3: 1.82 ms per loop # jezrael #1
1000 loops, best of 3: 1.84 ms per loop # jezrael #2
1000 loops, best of 3: 1.78 ms per loop # mine #1
100 loops, best of 3: 2.6 ms per loop # mine #2
100 loops, best of 3: 3.96 ms per loop # John Galt
df * 10000
)100 loops, best of 3: 12.1 ms per loop # jezrael #1
100 loops, best of 3: 11.9 ms per loop # jezrael #2
100 loops, best of 3: 11.9 ms per loop # mine #1
1 loop, best of 3: 3.51 s per loop # mine #2
1 loop, best of 3: 14.8 s per loop # John Galt
Upvotes: 1
Reputation: 863236
First sort all values per rows by numpy.sort
and then groupby
with transform
function size
:
Notice: Difference between size
and count
is count
NOT count NaN
values.
df['a'] = pd.DataFrame(np.sort(df.values, 1)).groupby([0,1])[0].transform('size')
print (df)
srcp dstp a
0 55647 22 1
1 22 55670 6
2 2222 56354 4
3 55670 22 6
4 55670 22 6
5 2222 56354 4
6 56362 139 1
7 22 55670 6
8 22 55670 6
9 56354 2222 4
10 22 55670 6
11 56354 2222 4
Similar:
df['a'] = pd.DataFrame(np.sort(df.values, 1))
.groupby(list(range(len(df.columns))))[0]
.transform('size')
print (df)
srcp dstp a
0 55647 22 1
1 22 55670 6
2 2222 56354 4
3 55670 22 6
4 55670 22 6
5 2222 56354 4
6 56362 139 1
7 22 55670 6
8 22 55670 6
9 56354 2222 4
10 22 55670 6
11 56354 2222 4
Upvotes: 1
Reputation: 76957
You could use sort
and groupby
on columns
In [1923]: df['freq'] = (df.apply(np.sort, 1)
.groupby(['srcp', 'dstp'])['srcp']
.transform('size'))
In [1924]: df
Out[1924]:
srcp dstp freq
0 55647 22 1
1 22 55670 6
2 2222 56354 4
3 55670 22 6
4 55670 22 6
5 2222 56354 4
6 56362 139 1
7 22 55670 6
8 22 55670 6
9 56354 2222 4
10 22 55670 6
11 56354 2222 4
Upvotes: 1