JetCorey
JetCorey

Reputation: 303

Python: How do you keep all of your data when using .Value_counts()?

.Value_counts() gets rid of the rest of my data. Analyze my data without loosing the rest of the information? Or is there another word counter code I can use that doesn't remove the rest of my columns of data?

Here's my code:

from pandas import DataFrame, read_csv
import pandas as pd  
f1 = pd.read_csv('lastlogonuser.txt', sep='\t', encoding='latin1')
f2 = pd.read_csv('UserAccounts.csv', sep=',', encoding ='latin1')
f2 = f2.rename(columns={'Shortname':'User Name'})
f = pd.concat([f1, f2])
counts = f['User Name'].value_counts()
f = counts[counts == 1] 
f 

I get something like this when I run my code:

sample534         1
sample987         1
sample342         1
sample321         1
sample123         1

I would like something like:

   User Name    Description                    CN Account
1  sample534    Journal Mailbox managed by         
1  sample987    Journal Mailbox managed by    
1  sample342    Journal Mailbox managed by   
1  sample321    Journal Mailbox managed by 
1  sample123    Journal Mailbox managed by 

Sample of data I am using:

enter code here
Account User Name User CN                       Description
ENABLED MBJ29     CN=MBJ29,CN=Users             Journal Mailbox managed by  
ENABLED MBJ14     CN=MBJ14,CN=Users             Journal Mailbox managed by
ENABLED MBJ08     CN=MBJ30,CN=Users             Journal Mailbox managed by   
ENABLED MBJ07     CN=MBJ07,CN=Users             Journal Mailbox managed by 

Upvotes: 0

Views: 93

Answers (1)

root
root

Reputation: 33793

You can use DataFrame.duplicated to determine which rows are duplicates, and then filter using loc:

f = f.loc[~f.duplicated(subset=['User Name'], keep=False), :]

The subset parameter specifies to just look for duplicates in the 'User Name' column. The keep=False argument specifies to flag all duplicates. Since duplicated returns True for duplicates, I negated it with ~.

This appears to be much more efficient than groupby when testing on a fairly large DataFrame with a fair amount of duplicates:

%timeit f.loc[~f.duplicated(subset=['User Name'], keep=False), :]
100 loops, best of 3: 17.4 ms per loop

%timeit f.groupby('User Name').filter(lambda x: len(x) == 1)
1 loop, best of 3: 6.78 s per loop

Upvotes: 1

Related Questions