enea19
enea19

Reputation: 123

Finding the longest linear section of non-linear plot in MATLAB

Apologies for the long post but this takes a bit to explain. I'm trying to make a script that finds the longest linear portion of a plot. Sample data is in a csv file here, it is stress and strain data for calculating the shear modulus of 3D printed samples. The code I have so far is the following:

x_data = [];
y_data = [];
x_data = Data(:,1);
y_data = Data(:,2);
plot(x_data,y_data);
grid on;

answer1 = questdlg('Would you like to load last attempt''s numbers?');
switch answer1
    case 'Yes'
        [sim_slopes,reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
    case 'No'
        disp('Take a look at the plot, find a range estimate, and press any button to continue');
        pause;

        prompt = {'Eliminate values ABOVE this x-value:','Eliminate values BELOW this x-value:','Size of divisions on x-axis:','Factor for similarity of slopes:'};
        dlg_title = 'Point elimination';
        num_lines = 1;
        defaultans = {'0','0','0','0.1'};
        if isempty(answer2) < 1
            defaultans = {answer2{1},answer2{2},answer2{3},answer2{4}};
        end
        answer2 = inputdlg(prompt,dlg_title,num_lines,defaultans);
        uv_of_x_range = str2num(answer2{1});
        lv_of_x_range = str2num(answer2{2});
        x_div_size = str2num(answer2{3});
        K = str2num(answer2{4});
        close all;

        iB = find(x_data > str2num(answer2{1}),1,'first');
        iS = find(x_data > str2num(answer2{2}),1,'first');

        new_x_data = x_data(iS:iB);
        new_y_data = y_data(iS:iB);

        [sim_slopes, reg_data] = regr_and_longest_part(new_x_data,new_y_data,str2num(answer2{3}),str2num(answer2{2}),K);
end

[longest_section0, Midx]= max(sim_slopes(:,4)-sim_slopes(:,3));
longest_section=1+longest_section0;

long_sec_x_data_start = x_div_size*(sim_slopes(Midx,3)-1)+lv_of_x_range;
long_sec_x_data_end = x_div_size*(sim_slopes(Midx,4)-1)+lv_of_x_range;

long_sec_x_data_start_idx=find(new_x_data >= long_sec_x_data_start,1,'first');
long_sec_x_data_end_idx=find(new_x_data >= long_sec_x_data_end,1,'first');

long_sec_x_data = new_x_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);
long_sec_y_data = new_y_data(long_sec_x_data_start_idx:long_sec_x_data_end_idx);

[b_long_sec, longes_section_reg_data] = robustfit(long_sec_x_data,long_sec_y_data);
plot(long_sec_x_data,b_long_sec(1)+b_long_sec(2)*long_sec_x_data,'LineWidth',3,'LineStyle',':','Color','k');

function [sim_slopes,reg_data] = regr_and_longest_part(x_points,y_points,x_div,lv,K)
reg_data = cell(1,3);

scatter(x_points,y_points,'.');
grid on;
hold on;

uv = lv+x_div;

ii=0;
while lv <= x_points(end)
    if uv > x_points(end)
        uv = x_points(end);
    end
    ii=ii+1;
    indices = find(x_points>lv & x_points<uv);
    temp_x_points = x_points((indices));
    temp_y_points = y_points((indices));
    if length(temp_x_points) <= 2
        break;
    end
    [b,stats] = robustfit(temp_x_points,temp_y_points);
    reg_data{ii,1} = b(1);
    reg_data{ii,2} = b(2);
    reg_data{ii,3} = length(indices);
    plot(temp_x_points,b(1)+b(2)*temp_x_points,'LineWidth',2);
    lv = lv+x_div;
    uv = lv+x_div;
end

sim_slopes = NaN(length(reg_data),4);
sim_slopes(1,:) = [reg_data{1,1},0,1,1];
idx=1;

for ii=2:length(reg_data)
    coff =sim_slopes(idx,1);
    if abs(reg_data{ii,1}-coff) <= K*coff
        C=zeros(ii-sim_slopes(idx,3)+1,1);
        for kk=sim_slopes(idx,3):ii
            C(kk)=reg_data{kk,1};
        end
        sim_slopes(idx,1)=mean(C);
        sim_slopes(idx,2)=std(C);
        sim_slopes(idx,4)=ii;
    else
        idx = idx + 1;
        sim_slopes(idx,1)=reg_data{ii,1};
        sim_slopes(idx,2)=0;
        sim_slopes(idx,3)=ii;
        sim_slopes(idx,4)=ii;
    end
end
end

Apologies for the code not being well optimized, I'm still relatively new to MATLAB. I did not use derivatives because my data is relatively noisy and derivation might have made it worse.

I've managed to get the get the code to find the longest straight part of the plot by splitting the data up into sections called x_div_size then performing a robustfit on each section, the results of which are written into reg_data. The code then runs through reg_data and finds which lines have the most similar slopes, determined by the K factor, by calculating the average of the slopes in a section of the plot and makes a note of it in sim_slopes. It then finds the longest interval with max(sim_slopes(:,4)-sim_slopes(:,3)) and performs a regression on it to give the final answer.

The problem is that it will only consider the first straight portion that it comes across. When the data is plotted, it has a few parts where it seems straightest:

Plot of raw data

As an example, when I run the script with answer2 = {'0.2','0','0.0038','0.3'} I get the following, where the black line is the straightest part found by the code:

Longest straight part

I have the following questions:

  1. It's clear that from about x = 0.04 to x = 0.2 there is a long straight part and I'm not sure why the script is not finding it. Playing around with different values the script always seems to pick the first longest straight part, ignoring subsequent ones.
  2. MATLAB complains that Warning: Iteration limit reached. because there are more than 50 regressions to perform. Is there a way to bypass this limit on robustfit?
  3. When generating sim_slopes there might be section of the plot whose slope is too different from the average of the previous slopes so it gets marked as the end of a long section. But that section sometimes is sandwiched between several other sections on either side which instead have similar slopes. How would it be possible to tell the script to ignore one wayward section and to continue as if it falls within the tolerance allowed by the K value?

Upvotes: 3

Views: 1272

Answers (2)

CKT
CKT

Reputation: 781

You can also try using the ischange function to detect changes in the intercept and slope of the data, and then extract the longest portion from that.

Using the sample data you provided, here is what I see from a basic attempt:

 >> T = readtable('Data.csv');
 >> T = rmmissing(T);  % Remove rows with NaN
 >> T = groupsummary(T,'Var1','mean'); % Average duplicate timestamps
 >> [tf,slopes,intercepts] = ischange(T.mean_Var2, 'linear', 'SamplePoints', T.Var1); % find changes
 >> plot(T.Var1, T.mean_Var2, T.Var1, slopes.*T.Var1 + intercepts)

which generates the plot

Example plot for ischange usage.

You should be able to extract the longest segment based on the indices given by find(tf).

You can also tune the parameters of ischange to get fewer or more segments. Adding the name-value pair 'MaxNumChanges' with a value of 4 or 5 produces more linear segments with a tighter fit to the curve, for example, which effectively removes the kink in the plot that you see.

Upvotes: 2

Cris Luengo
Cris Luengo

Reputation: 60494

Take a look at the Douglas-Peucker algorithm. If you think of your (x,y) values as the vertices of an (open) polygon, this algorithm will simplify it for you, such that the largest distance from the simplified polygon to the original is smaller than some threshold you can choose. The simplified polygon will be the set of straight lines. Find the two vertices that are furthest apart, and you're done.

MATLAB has an implementation in the Mapping Toolbox called reducem. You might also find an implementation on the File Exchange (but be careful, there is also really bad code on there). Or, you can roll your own, it's quite a simple algorithm.

Upvotes: 3

Related Questions