Pandas: Select balanced sample

Question

I have a data frame with 3000 companies covering five years.

Id     Company          Year       Value
0      1111111          2016         NaN
1      1111111          2015      3871.0
2      3333333          2016      3989.0
3      3333333          2015      3648.0
4      4444444          2016      5456.0
5      4444444          2015         NaN
6      2222222          2016         NaN
7      2222222          2015        10.0
8      5555555          2016      1515.0
9      5555555          2015      2654.0

I like to make a selection, that makes sure it is all companies that does not have a NaN value. So there is data for all periods in the selection, and thus an equal number of companies per period.

What is the easiest way doing this?

result should be:

Id     Company          Year       Value
2      3333333          2016      3989.0
3      3333333          2015      3648.0
7      5555555          2016      1515.0
8      5555555          2015      2654.0

Thanks

user2285236 · Accepted Answer

groupby.count() returns the number of non-null values so if you groupby companies, the count should be equal to the number of years. Assuming no duplicates, you can do this:

df.ix[df.groupby('Company')['Value'].transform('count') > 1, :]
Out[259]: 
   Id  Company  Year   Value
2   2  3333333  2016  3989.0
3   3  3333333  2015  3648.0
8   8  5555555  2016  1515.0
9   9  5555555  2015  2654.0

Pandas: Select balanced sample

Answers (1)

Related Questions