Reputation: 461
I have a single dataframe like the following:
ip domain
0 46.101.214.145 cnwkabrnyld1c0[.]com
1 46.101.214.145 anfj63m[.]com
2 46.101.214.145 anf3xnem[.]com
3 69.195.129.70 cnwkabrnyld1c0[.]com
4 69.195.129.70 anfj63ms[.]com
I am trying to groupby the "ip" column such that it will aggregate the duplicates onto the same line.
Desired output:
ip domain
0 46.101.214.145 cnwkabrnyld1c0[.]com | anfj63m[.]com | anf3xnem[.]com
3 69.195.129.70 cnwkabrnyld1c0[.]com
4 69.195.129.70 anfj63ms[.]com
This seems pretty straight forward based on the countless SOF and Panda documentation I've been reading.
Here's my code:
pDNSPanda = pd.read_csv('/tmp/pDNSCSV.csv', names=['ip', 'domain'])
g = pDNSPanda.groupby("ip").agg('|'.join)
Here's what I actually get:
domain
ip
0.0.0.0 ip|domain
1.1.1.200 ip|domain
1.148.56.163 ip|domain
1.167.81.129 ip|domain
1.193.28.230 ip|domain
Edit:
It was suggested that I try my own code against the above data set.
test.csv
46.101.214.145 cnwkabrnyld1c0[.]com
46.101.214.145 anfj63m[.]com
46.101.214.145 anf3xnem[.]com
69.195.129.70 cnwkabrnyld1c0[.]com
69.195.129.70 anfj63ms[.]com
>>> import pandas as pd
>>> p = pd.read_csv('/tmp/test.csv', names=['ip', 'domain'])
>>> g = p.groupby("ip").agg("|".join)
>>> print g
domain
ip
46.101.214.145 anfj63m[.]com ip|domain
46.101.214.145 anf3xnem[.]com ip|domain
46.101.214.145 cnwkabrnyld1c0[.]com ip|domain
69.195.129.70 anfj63ms[.]comp|domain
69.195.129.70 cnwkabrnyld1c0[.]com ip|domain
Upvotes: 0
Views: 845
Reputation: 109546
You need domain
between the groupby
and agg
commands. Add spacing around you pipe join if desired, e.g. ' | '.join(...)
.
pDNSPanda.groupby("ip").domain.agg('|'.join)
ip
46.101.214.145 cnwkabrnyld1c0[.]com|anfj63m[.]com|anf3xnem[.]com
69.195.129.70 cnwkabrnyld1c0[.]com|anfj63ms[.]com
Name: domain, dtype: object
If you have floats in your domains, it is best to convert them to strings.
df['domain'] = df.domain.astype(str)
Upvotes: 2
Reputation: 862641
I think this first values in column ip
are unique like my first and second row - then you get only one value in column domain
after groupby
:
print pDNSPanda
ip domain
0 0.0.0.0 ip|domain
1 1.1.1.200 ip|domain
2 46.101.214.145 cnwkabrnyld1c0[.]com
3 46.101.214.145 anfj63m[.]com
4 46.101.214.145 anf3xnem[.]com
5 69.195.129.70 cnwkabrnyld1c0[.]com
6 69.195.129.70 anfj63ms[.]com
g = pDNSPanda.groupby("ip").agg('|'.join)
print g
domain
ip
0.0.0.0 ip|domain
1.1.1.200 ip|domain
46.101.214.145 cnwkabrnyld1c0[.]com|anfj63m[.]com|anf3xnem[.]com
69.195.129.70 cnwkabrnyld1c0[.]com|anfj63ms[.]com
But I think is better use groupby
with aggregate column, as Alexander
mentioned in his answer.
Upvotes: 1