nck
nck

Reputation: 2221

How can I select n samples out of m equally spreaded within a range?

Lets say I want to keep a number of n backups from a set of m backups in a month of d days.

For example: I have 30 days (d=30), I made a backup everyday, so m=30, and I want to keep 3 as spreaded as possible n=3.

So my input with this paramters would be for january 2022:

20220101
20220102
20220103
20220104
20220105
20220106
20220107
20220108
20220109
20220110
20220111
20220112
20220113
20220114
20220115
20220116
20220117
20220118
20220119
20220120
20220121
20220122
20220123
20220124
20220125
20220126
20220127
20220128
20220129
20220130

And the output for this scenario:

20220101
20220115
20220130

But I want this to be dynamically set through n,m,d. Because I may have less than 30 backups per month not correctly spreaded (for example just from 1st to 15th, or one every 3 days), and I still want to keep 3 or n as spreaded as possible.

I have been thinking about this for a while and I think the algorithm should be something simple but I still I´m not able to get it. I would like to do it in bash or perl, but just getting the algorithm would be more than enough help.

Upvotes: 0

Views: 70

Answers (1)

James Brown
James Brown

Reputation: 37424

Using awk. Try setting the -v n=3 to other values:

$ awk -v n=3 '
{
    a[NR]=$0                                   # store dates to an array, ordered
}                                              # prebuilt count of values NR is m
END {                                          # after all values were stored
    print a[1]                                 # print the first value
    for(i=1;i<=(n-1);i++)                      # loop
        print a[(y=int(x=NR*i/(n-1)))<x?y+1:y] # output with ceil() 
        # print a[int(NR*i/(n-1))]             # old output
}' file                                        # <(sort file) if unordered list

Output:

20220101
20220115
20220130

Updated: The bare int() in print a[int()] was not enough so I replaced it with ceil()ish implementation (an improvisation of this solution). Now it gives better result to @GerardH.Pille's sample in the comments (thanks for pointing it out):

20220101
20220103  # this was 20220102 previously
20220131

but as this small solution does not rely on the content of the data, but only the positions of the content hence relying on even spread of values, it can't perfectly produce the optimal output.

Upvotes: 1

Related Questions