joasa
joasa

Reputation: 976

Pandas: Find frequency of occurrences in a DF

I have a dataframe with 2 columns, Time and Pressure, with around 3000 rows, as this:

time  value
0    393
1    389
2    402
3    408
4    413
5    463
6    471
7    488
8    422
9    404
10   370

I want to find 1) the most frequent value of pressure and 2) after how many time-steps we see this value. My code is this so far:

import numpy as np
import pandas as pd
from matplotlib.pylab import *
import re
from pylab import *
import datetime
from scipy import stats

pd.set_option('display.max_rows', 5000)

df = pd.read_csv('copy.csv')
row = next(df.iterrows())[0]
dataset = np.loadtxt(df, delimiter=";")

df.columns = ["LTimestamp", "LPressure"]
list(df.columns.values)

## Timestep
df = pd.DataFrame({'timestep': df.LTimestamp, 'value': df.LPressure})
df['timestep'] = pd.to_datetime(df['timestep'], unit='ms').dt.time
# print(df)

## Find most seen value in pressure
count = df['value'].value_counts().sort_values(ascending=[False]).nlargest(1).values[0]
print (count)

## Mask the df by comparing the column against the most seen value.
print(df[df['value'] == count])

## Find interval differences
x = df.loc[df['value'] == count, 'timestep'].diff() 
print(x)

The output is this, where 101 is the number of times the most frequent value (400) occurs.

>>> 101
>>> Empty DataFrame
>>> Columns: [timestep, value]
>>> Index: []
>>> Series([], Name: timestep, dtype: object)
>>> [Finished in 1.7s]

I don't understand why it returns an empty Index array. If instead of

print(df[df['value'] == count])

I use

print(df[df['value'] == 400])

I can see the masked df with the interval differences, as here:

50        1.0
112      62.0
215     103.0
265      50.0
276      11.0
277       1.0
278       1.0
318      40.0
366      48.0
367       1.0

But later on, I will want to calculate this for the minimum values, or the second largest etc. This is why I want to use count and not a specific number. Can someone help with this?

Upvotes: 2

Views: 2418

Answers (2)

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

A more general solution is to assign a rank of the frequency to each value in df.

import pandas as pd

df = pd.DataFrame({
    'time': np.arange(20)
    })

df['value'] = df.time ** 2 % 7

vcs = {v: i for i, v in enumerate(df.value.value_counts().index)}

df['freq_rank'] = df.value.apply(vcs.get)

Upvotes: 2

roman
roman

Reputation: 117345

I'd suggest to use

>>> val = df['value'].value_counts().nlargest(1).index[0]
>>> df[df['value'] == val]
   time  value
2     2    402
3     3    402
7     7    402
8     8    402

Upvotes: 3

Related Questions