Reputation: 769
I need to process a file consisting of records like the following:
5145
Xibraltar: vista xeral do Peñón
1934, xaneiro, 1 a 1934, decembro, 31
-----FOT-5011--
Nota a data: extraída do listado de compra.
5146
Xixón: a praia de San Lorenzo desde o balneario
ca.1920-1930
-----FOT-3496--
5147
Xixón: balneario e praia de San Lorenzo
ca.1920-1930
Tipos de unidades de instalación: FOT:FOT
-----FOT-3493--
I need to remove the 1 to 4 digits record number (i.e.: 5145) and any notes such as "Nota a data: extraída do listado de compra" which always come at the end of the record, after the signature (-----FOT-xxxx--) and before the next record's record number.
I've been trying to write an awk program to do this but I don't seem to be able to grasp awk's syntax or regular expressions at all.
Here's my attempt to match record numbers, those lines consisting of 1 to 4 digits only. (I think I'm missing the "only" part).
$ gawk '!/[[:digit:]]{1,4}/ { print $0 }' myUTF8file.txt
Also, I can match these (record signatures):
$ gawk '/-----FOT-[[:digit:]]{4}--/ { print $0 }' myUTF8file.txt
-----FOT-3411--
-----FOT-3406--
-----FOT-3397--
-----FOT-3412--
...
but I don't know how to remove the lines in between those and the record numbers.
Excuse my English and my repeated use of the word record, which I know might be confusing given the topic.
Upvotes: 0
Views: 56
Reputation: 15501
If the note lines always start with the string "Nota " (and no other lines start that way) then this will work.
awk '
/^[0-9]{1,4}$/ {next}
/^Nota / {next}
1
' file
Your regular expression was wrong in two ways:
{1-4}
instead of {1,4}
The 1
in the awk script above is a pattern that is always true, so it causes the default action (printing the record) to be executed.
Upvotes: 1
Reputation: 247022
A little state machine:
awk '
p {print}
/^[[:digit:]]{4}$/ {p=1}
/^-----FOT-[[:digit:]]{4}--$/ {p=0}
' file
Print a line when the p
variable is true: turn on printing after seeing the 4-digit line, stop printing after seeing the "FOT" line.
Upvotes: 2