Reputation: 229
Two dataframe columns:
data['IP'] data['domain']
10.20.30.40 example.org
10.20.30.40 example.org
10.20.30.40 example.org
10.20.30.40 example.org
1.2.3.4 google.com
1.2.3.4 google.com
1.2.3.4 google.com
200.100.200.100 yahoo.com
200.100.200.100 yahoo.com
9.8.7.6 random.com
I want to find an efficient way that counts how many times each domain is being mapped to the same IP address. Then, if the number of occurences is more than two(2) , take the specific domains ( BUT only the unique values) and move them to another dataframe or column.
So the output could be something like:
[Occurences] [To be processed]
4 example.org
4 google.com
4
4
3
3
3
I have tried different things , like Graphs and then taking the degree of the nodes , and pivot tables to express the number , but I want an efficient way that will allow me to continue with the processing of the domain after the if occur>2 statement.
All should be implemented with python panda dataframes!
Upvotes: 2
Views: 176
Reputation: 393953
The following performs a groupby
on 'domain' and then calls value_counts
on 'IP' address, we then filter this and reset the index and rename the columns so they are more meaningful:
In [58]:
gp = df.groupby('domain')['IP'].value_counts()
df1 = gp[gp > 2].reset_index()
df1.rename(columns={'level_1': 'IP', 0:'Occurences'}, inplace=True)
df1
Out[58]:
domain IP Occurences
0 example.org 10.20.30.40 4
1 google.com 1.2.3.4 3
Upvotes: 3