mand
mand

Reputation: 129

Using find_peaks in loop to find peak heights from data

This problem is about using scipy.signal.find_peaks for extracting mean peak height from data files efficiently. I am a beginner with Python (3.7), so I am not sure if I have written my code in the most optimal way, with regard to speed and code quality.

I have a set of measurement files containing one million data points (30MB) each. The graph of this data is a signal with peaks at regular intervals, and with noise. Also, the baseline and the amplitude of the signal vary across parts of the signal. I attached an image of an example. The signal can be much less clean.

this is what the peaks might look like

My goal is to calculate the mean height of the peaks for each file. In order to do this, first I use find_peaks to locate all the peaks. Then, I loop over each peak location and detect the peak in a small interval around the peak, to make sure I get the local height of the peak.

I then put all these heights in numpy arrays and calculate their mean and standard deviation afterwards.

Here is a barebone version of my code, it is a bit long but I think that might also be because I am doing something wrong.

import numpy as np
from scipy.signal import find_peaks

# Allocate empty lists for values
mean_heights = []
std_heights = []
mean_baselines = []
std_baselines = []
temperatures = []

# Loop over several files, read them in and process data
for file in file_list:
    temperatures.append(file)
    # Load in data from a file of 30 MB
    t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
    # Find all peaks in this file
    peaks, peak_properties = find_peaks(x_dat, prominence=prom, width=0)

    # Calculate window size, make sure it is even
    if round(len(t_dat)/len(peaks)) % 2 == 0:
        n_points = len(t_dat) // len(peaks)
    else:
        n_points = len(t_dat) // len(peaks) + 1

    t_slice = t_dat[-1] / len(t_dat)

    # Allocate np arrays for storing heights
    baseline_list = np.zeros(len(peaks) - 2)
    height_list = np.zeros(len(peaks) - 2)

    # Loop over all found peaks, and re-detect the peak in a window around the peak to be able
    # to detect its local height without triggering to a baseline far away
    for i in range(len(peaks) - 2):
        # Making a window around a peak_properties
        sub_t = t_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
        sub_x = x_dat[peaks[i+1] - n_points // 2: peaks[i+1] + n_points // 2]
        # Detect the peaks (2 version, specific to the application I have)
        h_min = max(sub_x) - np.mean(sub_x)
        _, baseline_props = find_peaks(
            sub_x, prominence=h_min, distance=n_points - 1, width=0)
        _, height_props = find_peaks(np.append(
            min(sub_x) - 1, sub_x), prominence=h_min, distance=n_points - 1, width=0)
        # Add the heights to the np arrays storing the heights
        baseline_list[i] = baseline_props["prominences"]
        height_list[i] = height_props["prominences"]

    # Fill lists with values, taking the stdev and mean of the np arrays with the heights
    mean_heights.append(np.mean(height_list))
    std_heights.append(np.std(height_list))
    mean_baselines.append(np.mean(baseline_list))
    std_baselines.append(np.std(baseline_list))

It takes ~30 s to execute. Is this normal or too slow? If so, can it be optimised?

Upvotes: 2

Views: 2300

Answers (1)

mand
mand

Reputation: 129

In the meantime I have improved the speed by getting rid of various inefficiencies, that I found by using the Python profiler. I will list the optimisations here ordered by significance for the speed:

  • Using pandas pd.read_csv() for I/O instead of np.loadtxt() cut off about 90% of the runtime. As also mentioned here, this saves a lot of time. This means changing this:

    t_dat, x_dat = np.loadtxt(file, delimiter='\t', unpack=True)
    

    to this:

    data = pd.read_csv(file, delimiter = "\t", names=["t_dat", "x_dat"])
    t_dat = data.values[:,0]
    x_dat = data.values[:,1]
    
  • Removing redundant len() calls. I noticed that len() was called many times, and then noticed that this happened unnecessary. Changing this:

    if round(len(t_dat) / len(peaks)) % 2 == 0:
        n_points = int(len(t_dat) / len(peaks))
    else:
        n_points = int(len(t_dat) / len(peaks) + 1)
    

    to this:

    n_points = round(len(t_dat) / len(peaks))
    if n_points % 2 != 0:
        n_points += 1
    

proved to be also a significant improvement.

  • Lastly, a disproportionally component of the computational time (about 20%) was used by the built in Python functions min(), max() and sum(). Since I was using numpy arrays already, switching to the numpy equivalents for these functions, resulted in a 84% improvement on this part. This means for example changing max(sub_x) to sub_x.max().

These are all unrelated optimisations that I still think might be useful for Python beginner like myself, and they do help a lot.

Upvotes: 1

Related Questions