Reputation: 303
.Value_counts() gets rid of the rest of my data. Analyze my data without loosing the rest of the information? Or is there another word counter code I can use that doesn't remove the rest of my columns of data?
Here's my code:
from pandas import DataFrame, read_csv
import pandas as pd
f1 = pd.read_csv('lastlogonuser.txt', sep='\t', encoding='latin1')
f2 = pd.read_csv('UserAccounts.csv', sep=',', encoding ='latin1')
f2 = f2.rename(columns={'Shortname':'User Name'})
f = pd.concat([f1, f2])
counts = f['User Name'].value_counts()
f = counts[counts == 1]
f
I get something like this when I run my code:
sample534 1
sample987 1
sample342 1
sample321 1
sample123 1
I would like something like:
User Name Description CN Account
1 sample534 Journal Mailbox managed by
1 sample987 Journal Mailbox managed by
1 sample342 Journal Mailbox managed by
1 sample321 Journal Mailbox managed by
1 sample123 Journal Mailbox managed by
Sample of data I am using:
enter code here
Account User Name User CN Description
ENABLED MBJ29 CN=MBJ29,CN=Users Journal Mailbox managed by
ENABLED MBJ14 CN=MBJ14,CN=Users Journal Mailbox managed by
ENABLED MBJ08 CN=MBJ30,CN=Users Journal Mailbox managed by
ENABLED MBJ07 CN=MBJ07,CN=Users Journal Mailbox managed by
Upvotes: 0
Views: 93
Reputation: 33793
You can use DataFrame.duplicated
to determine which rows are duplicates, and then filter using loc
:
f = f.loc[~f.duplicated(subset=['User Name'], keep=False), :]
The subset
parameter specifies to just look for duplicates in the 'User Name'
column. The keep=False
argument specifies to flag all duplicates. Since duplicated
returns True
for duplicates, I negated it with ~
.
This appears to be much more efficient than groupby
when testing on a fairly large DataFrame with a fair amount of duplicates:
%timeit f.loc[~f.duplicated(subset=['User Name'], keep=False), :]
100 loops, best of 3: 17.4 ms per loop
%timeit f.groupby('User Name').filter(lambda x: len(x) == 1)
1 loop, best of 3: 6.78 s per loop
Upvotes: 1