How do I match lines consisting *only* of four digits and remove lines in betweeen two regexp matches?

Question

I need to process a file consisting of records like the following:

5145
Xibraltar: vista xeral do Peñón
1934, xaneiro, 1 a 1934, decembro, 31
-----FOT-5011--
Nota a data: extraída do listado de compra.
5146
Xixón: a praia de San Lorenzo desde o balneario
ca.1920-1930
-----FOT-3496--
5147
Xixón: balneario e praia de San Lorenzo
ca.1920-1930
Tipos de unidades de instalación: FOT:FOT
-----FOT-3493--

I need to remove the 1 to 4 digits record number (i.e.: 5145) and any notes such as "Nota a data: extraída do listado de compra" which always come at the end of the record, after the signature (-----FOT-xxxx--) and before the next record's record number.

I've been trying to write an awk program to do this but I don't seem to be able to grasp awk's syntax or regular expressions at all.

Here's my attempt to match record numbers, those lines consisting of 1 to 4 digits only. (I think I'm missing the "only" part).

$ gawk '!/[[:digit:]]{1,4}/ { print $0 }' myUTF8file.txt

Also, I can match these (record signatures):

$ gawk '/-----FOT-[[:digit:]]{4}--/ { print $0 }' myUTF8file.txt
-----FOT-3411--
-----FOT-3406--
-----FOT-3397--
-----FOT-3412--
...

but I don't know how to remove the lines in between those and the record numbers.

Excuse my English and my repeated use of the word record, which I know might be confusing given the topic.

glenn jackman · Accepted Answer

A little state machine:

awk '
    p {print} 
    /^[[:digit:]]{4}$/ {p=1} 
    /^-----FOT-[[:digit:]]{4}--$/ {p=0}
' file

Print a line when the p variable is true: turn on printing after seeing the 4-digit line, stop printing after seeing the "FOT" line.

How do I match lines consisting only of four digits and remove lines in betweeen two regexp matches?

Answers (2)

Related Questions