Reputation: 976
I have a dataframe with 2 columns, Time and Pressure, with around 3000 rows, as this:
time value
0 393
1 389
2 402
3 408
4 413
5 463
6 471
7 488
8 422
9 404
10 370
I want to find 1) the most frequent value of pressure and 2) after how many time-steps we see this value. My code is this so far:
import numpy as np
import pandas as pd
from matplotlib.pylab import *
import re
from pylab import *
import datetime
from scipy import stats
pd.set_option('display.max_rows', 5000)
df = pd.read_csv('copy.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")
df.columns = ["LTimestamp", "LPressure"]
list(df.columns.values)
## Timestep
df = pd.DataFrame({'timestep': df.LTimestamp, 'value': df.LPressure})
df['timestep'] = pd.to_datetime(df['timestep'], unit='ms').dt.time
# print(df)
## Find most seen value in pressure
count = df['value'].value_counts().sort_values(ascending=[False]).nlargest(1).values[0]
print (count)
## Mask the df by comparing the column against the most seen value.
print(df[df['value'] == count])
## Find interval differences
x = df.loc[df['value'] == count, 'timestep'].diff()
print(x)
The output is this, where 101 is the number of times the most frequent value (400) occurs.
>>> 101
>>> Empty DataFrame
>>> Columns: [timestep, value]
>>> Index: []
>>> Series([], Name: timestep, dtype: object)
>>> [Finished in 1.7s]
I don't understand why it returns an empty Index array. If instead of
print(df[df['value'] == count])
I use
print(df[df['value'] == 400])
I can see the masked df with the interval differences, as here:
50 1.0
112 62.0
215 103.0
265 50.0
276 11.0
277 1.0
278 1.0
318 40.0
366 48.0
367 1.0
But later on, I will want to calculate this for the minimum values, or the second largest etc. This is why I want to use count
and not a specific number. Can someone help with this?
Upvotes: 2
Views: 2418
Reputation: 11602
A more general solution is to assign a rank of the frequency to each value in df
.
import pandas as pd
df = pd.DataFrame({
'time': np.arange(20)
})
df['value'] = df.time ** 2 % 7
vcs = {v: i for i, v in enumerate(df.value.value_counts().index)}
df['freq_rank'] = df.value.apply(vcs.get)
Upvotes: 2
Reputation: 117345
I'd suggest to use
>>> val = df['value'].value_counts().nlargest(1).index[0]
>>> df[df['value'] == val]
time value
2 2 402
3 3 402
7 7 402
8 8 402
Upvotes: 3