Reputation: 85
df1
Date Topic Return
1/1/2010 A,B -0.308648967
1/2/2010 C,D -0.465862046
1/3/2010 E 0.374052392
1/4/2010 F 0.520312204
1/5/2010 G 0.503889198
1/6/2010 H -1.730646788
1/7/2010 L,M,N 1.756295613
1/8/2010 K -0.598990239
......
1/30/2010 z 2,124355
Plot= df1.plot(x='Date', y='Return')
How can I find highest peaks and smallest troughs for this graph and label these special points with corresponding Topics?
Upvotes: 1
Views: 5592
Reputation: 7131
This depends a little bit on your definitions of "peak" and "trough". Oftentimes, a person might care about smoothed peaks and troughs to identify broad trends, especially in the presence of noise. In the event that you want every fine-grained dip or rise in the data though (and if your rows are sorted), you can cheat a little bit with vectorized routines from numpy
.
import numpy as np
d = np.diff(df['Return'])
i = np.argwhere((d[:-1]*d[1:])<=0).flatten()
special_points = df['Topic'][i+1]
The first line with np.diff()
compares each return value to the next return value. In particular, it subtracts them. Depending a little on your definition of a local peak/trough, these will have the property that you only have a feature you're looking for if these pairwise differences alternate in sign. Consider the following peak.
[1, 5, 1]
If you compute the pairwise differences, you get a slightly shorter vector
[4, -4]
Note that these alternate in sign. Hence, if you multiply them you get -16
, which must be negative. This is the exact insight that our code uses to identify the peaks and troughs. The dimension reduction offsets things a little bit, so we shift the indices we find by 1 (in the df['Topic'][i+1]
block).
Caveats: Note that we have <=
instead of strict inequality. This is in case we have a wider peak than normal. Consider [1, 2, 2, 2, 2, 2, 1]
. Arguably, the string of 2's represents a peak and would need to be captured. If that isn't desirable, make the inequality strict.
Additionally, if you're interested in wider peaks like that, this algorithm still isn't correct. It's plenty fast, but in general it only computes a superset of the peaks/troughs. Consider the following
[1, 2, 2, 3, 2, 1]
Arguably, the number 3 is the only peak in that dataset (depends a bit on your definitions of course), but our algorithm will also pick up the first and second instances of the number 2 due to their being on a shelf (being identical to a neighbor).
Extras: The scipy.signal
module has a variety of peak-finding algorithms which may be better suited depending on any extra requirements you have on your peaks. Modifying this solution is unlikely to be as fast or clean as using an appropriate built-in signal processor. A call to scipy.signal.find_peaks()
can basically replicate everything we've done here, and it has more options if you need them. Other algorithms like scipy.signal.find_peaks_cwt()
might be more appropriate if you need any kind of smoothing or more complicated operations.
Upvotes: 4
Reputation: 2910
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
# Take an example data
data = {"Date":["date{i}".format(i=i) for i in range(10)], "Topic":["topic{i}".format(i=i) for i in range(10)], "Return":[1,2,3,2,1,2,4,7,1,3]}
df = pd.DataFrame.from_dict(data)
dates = np.array(df["Date"].tolist())
returns = np.array(df["Return"].tolist())
# Calculate the minimas and the maximas
minimas = (np.diff(np.sign(np.diff(returns))) > 0).nonzero()[0] + 1
maximas = (np.diff(np.sign(np.diff(returns))) < 0).nonzero()[0] + 1
# Plot the entire data first
plt.plot(dates, returns)
# Then mark the maximas and the minimas
for minima in minimas:
plt.plot(df.iloc[minima]["Date"], df.iloc[minima]["Return"], marker="o", label=str(df.iloc[minima]["Topic"]))
for maxima in maximas:
plt.plot(df.iloc[maxima]["Date"], df.iloc[maxima]["Return"], marker="o", label=str(df.iloc[maxima]["Topic"]))
plt.legend()
plt.show()
Example dataframe:
Date Topic Return
0 date0 topic0 1
1 date1 topic1 2
2 date2 topic2 3
3 date3 topic3 2
4 date4 topic4 1
5 date5 topic5 2
6 date6 topic6 4
7 date7 topic7 7
8 date8 topic8 1
9 date9 topic9 3
Upvotes: 3