Rive11
Rive11

Reputation: 85

Find peaks and bottoms of graph and label them

      df1
      Date           Topic  Return
      1/1/2010        A,B     -0.308648967
      1/2/2010        C,D     -0.465862046
      1/3/2010        E        0.374052392
      1/4/2010        F        0.520312204
      1/5/2010        G        0.503889198
      1/6/2010        H       -1.730646788
      1/7/2010        L,M,N    1.756295613
      1/8/2010        K        -0.598990239
      ......
      1/30/2010       z         2,124355
 Plot= df1.plot(x='Date', y='Return')

How can I find highest peaks and smallest troughs for this graph and label these special points with corresponding Topics?

Upvotes: 1

Views: 5592

Answers (2)

Hans Musgrave
Hans Musgrave

Reputation: 7131

This depends a little bit on your definitions of "peak" and "trough". Oftentimes, a person might care about smoothed peaks and troughs to identify broad trends, especially in the presence of noise. In the event that you want every fine-grained dip or rise in the data though (and if your rows are sorted), you can cheat a little bit with vectorized routines from numpy.

import numpy as np

d = np.diff(df['Return'])
i = np.argwhere((d[:-1]*d[1:])<=0).flatten()
special_points = df['Topic'][i+1]

The first line with np.diff() compares each return value to the next return value. In particular, it subtracts them. Depending a little on your definition of a local peak/trough, these will have the property that you only have a feature you're looking for if these pairwise differences alternate in sign. Consider the following peak.

[1, 5, 1]

If you compute the pairwise differences, you get a slightly shorter vector

[4, -4]

Note that these alternate in sign. Hence, if you multiply them you get -16, which must be negative. This is the exact insight that our code uses to identify the peaks and troughs. The dimension reduction offsets things a little bit, so we shift the indices we find by 1 (in the df['Topic'][i+1] block).

Caveats: Note that we have <= instead of strict inequality. This is in case we have a wider peak than normal. Consider [1, 2, 2, 2, 2, 2, 1]. Arguably, the string of 2's represents a peak and would need to be captured. If that isn't desirable, make the inequality strict.

Additionally, if you're interested in wider peaks like that, this algorithm still isn't correct. It's plenty fast, but in general it only computes a superset of the peaks/troughs. Consider the following

[1, 2, 2, 3, 2, 1]

Arguably, the number 3 is the only peak in that dataset (depends a bit on your definitions of course), but our algorithm will also pick up the first and second instances of the number 2 due to their being on a shelf (being identical to a neighbor).

Extras: The scipy.signal module has a variety of peak-finding algorithms which may be better suited depending on any extra requirements you have on your peaks. Modifying this solution is unlikely to be as fast or clean as using an appropriate built-in signal processor. A call to scipy.signal.find_peaks() can basically replicate everything we've done here, and it has more options if you need them. Other algorithms like scipy.signal.find_peaks_cwt() might be more appropriate if you need any kind of smoothing or more complicated operations.

Upvotes: 4

Deepak Saini
Deepak Saini

Reputation: 2910

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

# Take an example data
data = {"Date":["date{i}".format(i=i) for i in range(10)], "Topic":["topic{i}".format(i=i) for i in range(10)], "Return":[1,2,3,2,1,2,4,7,1,3]}
df = pd.DataFrame.from_dict(data)

dates = np.array(df["Date"].tolist())
returns = np.array(df["Return"].tolist())

# Calculate the minimas and the maximas
minimas = (np.diff(np.sign(np.diff(returns))) > 0).nonzero()[0] + 1 
maximas = (np.diff(np.sign(np.diff(returns))) < 0).nonzero()[0] + 1

# Plot the entire data first
plt.plot(dates, returns)
# Then mark the maximas and the minimas
for minima in minimas:
    plt.plot(df.iloc[minima]["Date"], df.iloc[minima]["Return"], marker="o", label=str(df.iloc[minima]["Topic"]))
for maxima in maximas:
    plt.plot(df.iloc[maxima]["Date"], df.iloc[maxima]["Return"], marker="o", label=str(df.iloc[maxima]["Topic"]))

plt.legend()
plt.show()

Example dataframe:

   Date   Topic  Return
0  date0  topic0       1
1  date1  topic1       2
2  date2  topic2       3
3  date3  topic3       2
4  date4  topic4       1
5  date5  topic5       2
6  date6  topic6       4
7  date7  topic7       7
8  date8  topic8       1
9  date9  topic9       3

Plot it produces: enter image description here

Upvotes: 3

Related Questions