Reputation: 3041
I have a dataframe of this form. However, In my final dataframe, I'd like to only get a dataframe that has unique values per year.
Name Org Year
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
6 Babson College doclist[5] 2008
So ideally, my dataframe will look like this instead
4 New York University doclist[1] 2004
5 Babson College doclist[2] 2008
What I've done so far. I've used groupby by year, and I seem to be able to get the unique names by year. However, I am stuck because I lose all the other information, such as the "Org" column. Advice appreciated!
#how to get unique rows per year?
q = z.groupby(['Year'])
#print q.head()
#q.reset_index(level=0, drop=True)
q.Name.apply(lambda x: np.unique(x))
For this I get the following output. How do I include the other column information as well as removing the secondary index (eg: 6, 68, 66, 72)
Year
2008 6 Babson College
68 European Economic And Social Committee
66 European Union
72 Ewing Marion Kauffman Foundation
Upvotes: 0
Views: 106
Reputation: 52256
If all you want to do is keep the first entry for each name, you can use drop_duplicates
Note that this will keep the first entry based on however your data is sorted, so you may want to sort first if you want keep a specific entry.
In [98]: q.drop_duplicates(subset='Name')
Out[98]:
Name Org Year
0 New York University doclist[1] 2004
1 Babson College doclist[2] 2008
Upvotes: 1