MarArauyo
MarArauyo

Reputation: 11

How to use AWK/GAWK to select/print records/lines between two dates of the format YYYY/MM/DD?

I have a question related to the range option in gawk BEGPAT, ENDPAT {ACTION} , it seems unsuited to my case OR MORE LIKELY the problem is my misunderstanding of how range works.

I want to print/select the records/lines between a range of dates of the form YYYY-MM-DD. The dates are in a specific FIELD/Column, they are in ascending order, and they are not unique, ie:

2021-08-01
2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

How can I select lets say, from 2021-08-02 to 2021-08-05, the actual data goes back two years, to get:

2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

I tried the following: '/2021-08-03/, /2021-08-05/{print}'

Resulting in this:

2021-08-03
2021-08-04
2021-08-05

Any help within the scope of gawk/awk is appreciated. The documentation about ranges is here, but since I'm just trying to learn to code it can difficult to understand sometimes. Perhaps there are other approaches within awk to solve this?

Upvotes: 1

Views: 326

Answers (2)

Ed Morton
Ed Morton

Reputation: 203532

awk -v beg='2021-08-02' -v end='2021-08-05' '
    $1 >= beg { inRange=1 }
    $1 > end { exit }
    inRange { print }
' file

Unless you're coding strictly for brevity, range expressions are never the best approach and you should always use a flag variable (which I named inRange above but f or found or whatever other name you like is fine too) instead, see Is a /start/,/end/ range expression ever useful in awk?.

If you prefer a briefer solution you can do the above with hard-coded values and a shorter variable name as:

awk '$1=="2021-08-02"{f=1} $1>"2021-08-05"{exit} f' file

Note that, among other things, the above is more efficient than using a range expression as it'll exit after the range is printed rather than continuing reading the rest of the input.

Upvotes: 1

Daweo
Daweo

Reputation: 36450

I would say that it is unsuited to my case as you have repeats

2021-08-01
2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

so ENDPAT will trigger at first occurence of 2021-08-05. If you must use range AT ANY PRICE then you might use GNU AWK as follows, let file.txt content be

2021-08-01
2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

then

awk '/2021-08-0[25]/,/2021-08-05/{print}' file.txt

output

2021-08-02
2021-08-02
2021-08-02
2021-08-03
2021-08-04
2021-08-05
2021-08-05
2021-08-05

Explanation: there are 2 ranges in 1: one from 2021-08-02 to 2021-08-05 and second from 2021-08-05 to 2021-08-05. EDIT: If composing regular expression this way is not possible you might use | i.e. awk '/2021-08-02|2021-08-05/,/2021-08-05/' file.txt as suggested in comment

(tested in GNU Awk 5.0.1)

Upvotes: 0

Related Questions