Reputation: 653
I have a dataframe with the following column:
file['DirViento']
Fecha
2011-01-01 ENE
2011-01-02 ENE
2011-01-03 ENE
2011-01-04 NNE
2011-01-05 ENE
2011-01-06 ENE
2011-01-07 ENE
2011-01-08 ENE
2011-01-09 NNE
2011-01-10 ENE
2011-01-11 ENE
2011-01-12 ENE
2011-01-13 ESE
2011-01-14 ENE
2011-01-15 ENE
...
2011-12-17 ENE
2011-12-18 ENE
2011-12-19 ENE
2011-12-20 ENE
2011-12-21 ENE
2011-12-22 ENE
2011-12-23 ENE
2011-12-24 ENE
2011-12-25 ENE
2011-12-26 ESE
2011-12-27 ENE
2011-12-28 NE
2011-12-29 ENE
2011-12-30 NNE
2011-12-31 ENE
Name: DirViento, Length: 290, dtype: object
The column has daily records of wind direction for each month of the year. I'm trying to get the dominant direction for each month. To accomplish this, select the data most often repeated during the month:
file['DirViento'].groupby(lambda x: x.month).value_counts()
1 ENE 23
NNE 6
E 1
ESE 1
2 ENE 21
NNO 3
NNE 2
NE 1
3 ENE 21
OSO 1
ESE 1
SSE 1
4 ENE 21
NNE 2
ESE 1
NNO 1
6 ENE 15
ESE 2
SSE 2
ONO 1
E 1
7 ENE 22
ONO 1
OSO 1
NE 1
NNE 1
NNO 1
8 ENE 23
NNE 5
NE 1
ONO 1
ESE 1
9 ENE 17
NNE 7
ONO 2
NE 1
E 1
ESE 1
NNO 1
10 ENE 16
NNE 2
ESE 2
NNO 2
ONO 1
NE 1
E 1
11 ENE 13
NNE 2
ESE 2
ONO 1
12 ENE 26
NNE 3
NE 1
ESE 1
Length: 54, dtype: int64
When running the following line of code
wind_moda=file['DirViento'].groupby(lambda x: x.month).agg(lambda x: stats.mode(x)[0][0])
Should get something like this
1 ENE
2 ENE
3 ENE
4 ENE
6 ENE
7 ENE
8 ENE
9 ENE
10 ENE
11 ENE
12 ENE
But I get the following:
1 E
2 ENE
3 ENE
4 ENE
6 E
7 ENE
8 ENE
9 E
10 E
11 ENE
12 ENE
Why in 4 of the 12 months is not taking into account the most frequent data?
Am I doing something wrong ?
Any idea to get the most common data each month?
Upvotes: 8
Views: 59493
Reputation: 1
For most repeated value:
dataframe.mode()['ColumnName'][0]
For listing value of every data:
dataframe.groupby(['ColumnName']).size()
Upvotes: 0
Reputation: 131
For whole dataframe, you can use:
dataframe.mode()
For specific column:
dataframe.mode()['Column'][0]
Second case is more useful in imputing the values.
Upvotes: 5
Reputation: 1052
Pandas 0.15.2 has a DataFrame.mode()
method. It might be of use to someone looking for this as I was.
Here are the docs.
Edit: For the Value:
DataFrame.mode()[0]
Upvotes: 14
Reputation: 35235
This is not as straightforward as it could be (should be).
As you probably know, the statistics jargon for the most common value is the "mode." Numpy does not have a built-in function for this, but scipy does. Import it like so:
from scipy.stats.mstats import mode
It does more than simply return the most common value, as you can read about in the docs, so it's convenient to define a function that uses mode
to just get the most common value.
f = lambda x: mode(x, axis=None)[0]
And now, instead of value_counts()
, use apply(f)
. Here is an example:
In [20]: DataFrame([1,1,2,2,2,3], index=[1,1,1,2,2,2]).groupby(level=0).apply(f)
Out[20]:
1 1.0
2 2.0
dtype: object
Update: Scipy's mode
does not work with strings. For your string data, you'll need to define a more general mode function. This answer should do the trick.
Upvotes: 6