Reputation: 129
This problem is about using scipy.signal.find_peaks
for extracting mean peak height from data files efficiently. I am a beginner with Python (3.7), so I am not sure if I have written my code in the most optimal way, with regard to speed and code quality.
I have a set of measurement files containing one million data points (30MB) each. The graph of this data is a signal with peaks at regular intervals, and with noise. Also, the baseline and the amplitude of the signal vary across parts of the signal. I attached an image of an example. The signal can be much less clean.
My goal is to calculate the mean height of the peaks for each file. In order to do this, first I use find_peaks
to locate all the peaks. Then, I loop over each peak location and detect the peak in a small interval around the peak, to make sure I get the local height of the peak.
I then put all these heights in numpy arrays and calculate their mean and standard deviation afterwards.
Here is a barebone version of my code, it is a bit long but I think that might also be because I am doing something wrong.
import numpy as np
from scipy.signal import find_peaks
# Allocate empty lists for values
mean_heights = []
std_heights = []
mean_baselines = []
std_baselines = []
temperatures = []
# Loop over several files, read them in and process data
for file in file_list:
temperatures.append(file)
# Load in data from a file of 30 MB
t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
# Find all peaks in this file
peaks, peak_properties = find_peaks(x_dat, prominence=prom, width=0)
# Calculate window size, make sure it is even
if round(len(t_dat)/len(peaks)) % 2 == 0:
n_points = len(t_dat) // len(peaks)
else:
n_points = len(t_dat) // len(peaks) + 1
t_slice = t_dat[-1] / len(t_dat)
# Allocate np arrays for storing heights
baseline_list = np.zeros(len(peaks) - 2)
height_list = np.zeros(len(peaks) - 2)
# Loop over all found peaks, and re-detect the peak in a window around the peak to be able
# to detect its local height without triggering to a baseline far away
for i in range(len(peaks) - 2):
# Making a window around a peak_properties
sub_t = t_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
sub_x = x_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
# Detect the peaks (2 version, specific to the application I have)
h_min = max(sub_x) - np.mean(sub_x)
_, baseline_props = find_peaks(
sub_x, prominence=h_min, distance=n_points - 1, width=0)
_, height_props = find_peaks(np.append(
min(sub_x) - 1, sub_x), prominence=h_min, distance=n_points - 1, width=0)
# Add the heights to the np arrays storing the heights
baseline_list[i] = baseline_props["prominences"]
height_list[i] = height_props["prominences"]
# Fill lists with values, taking the stdev and mean of the np arrays with the heights
mean_heights.append(np.mean(height_list))
std_heights.append(np.std(height_list))
mean_baselines.append(np.mean(baseline_list))
std_baselines.append(np.std(baseline_list))
It takes ~30 s to execute. Is this normal or too slow? If so, can it be optimised?
Upvotes: 2
Views: 2300
Reputation: 129
In the meantime I have improved the speed by getting rid of various inefficiencies, that I found by using the Python profiler. I will list the optimisations here ordered by significance for the speed:
Using pandas pd.read_csv()
for I/O instead of np.loadtxt()
cut off about 90% of the runtime. As also mentioned here, this saves a lot of time. This means changing this:
t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
to this:
data = pd.read_csv(file, delimiter = "\t", names=["t_dat", "x_dat"])
t_dat = data.values[:,0]
x_dat = data.values[:,1]
Removing redundant len() calls. I noticed that len() was called many times, and then noticed that this happened unnecessary. Changing this:
if round(len(t_dat) / len(peaks)) % 2 == 0:
n_points = int(len(t_dat) / len(peaks))
else:
n_points = int(len(t_dat) / len(peaks) + 1)
to this:
n_points = round(len(t_dat) / len(peaks))
if n_points % 2 != 0:
n_points += 1
proved to be also a significant improvement.
max(sub_x)
to sub_x.max()
.These are all unrelated optimisations that I still think might be useful for Python beginner like myself, and they do help a lot.
Upvotes: 1