Nirro
Nirro

Reputation: 769

How do I match lines consisting *only* of four digits and remove lines in betweeen two regexp matches?

I need to process a file consisting of records like the following:

5145
Xibraltar: vista xeral do Peñón
1934, xaneiro, 1 a 1934, decembro, 31
-----FOT-5011--
Nota a data: extraída do listado de compra.
5146
Xixón: a praia de San Lorenzo desde o balneario
ca.1920-1930
-----FOT-3496--
5147
Xixón: balneario e praia de San Lorenzo
ca.1920-1930
Tipos de unidades de instalación: FOT:FOT
-----FOT-3493--

I need to remove the 1 to 4 digits record number (i.e.: 5145) and any notes such as "Nota a data: extraída do listado de compra" which always come at the end of the record, after the signature (-----FOT-xxxx--) and before the next record's record number.

I've been trying to write an awk program to do this but I don't seem to be able to grasp awk's syntax or regular expressions at all.

Here's my attempt to match record numbers, those lines consisting of 1 to 4 digits only. (I think I'm missing the "only" part).

$ gawk '!/[[:digit:]]{1,4}/ { print $0 }' myUTF8file.txt

Also, I can match these (record signatures):

$ gawk '/-----FOT-[[:digit:]]{4}--/ { print $0 }' myUTF8file.txt
-----FOT-3411--
-----FOT-3406--
-----FOT-3397--
-----FOT-3412--
...

but I don't know how to remove the lines in between those and the record numbers.

Excuse my English and my repeated use of the word record, which I know might be confusing given the topic.

Upvotes: 0

Views: 56

Answers (2)

ooga
ooga

Reputation: 15501

If the note lines always start with the string "Nota " (and no other lines start that way) then this will work.

awk '
  /^[0-9]{1,4}$/ {next}
  /^Nota /       {next}
  1
' file

Your regular expression was wrong in two ways:

  1. You wrote {1-4} instead of {1,4}
  2. You didn't use begin and end anchors so that it would match only if that number was the only thing on the line. So even with a proper quantifier it would have matched if there was a 1 to 4 digit number anywhere on the line.

The 1 in the awk script above is a pattern that is always true, so it causes the default action (printing the record) to be executed.

Upvotes: 1

glenn jackman
glenn jackman

Reputation: 247022

A little state machine:

awk '
    p {print} 
    /^[[:digit:]]{4}$/ {p=1} 
    /^-----FOT-[[:digit:]]{4}--$/ {p=0}
' file

Print a line when the p variable is true: turn on printing after seeing the 4-digit line, stop printing after seeing the "FOT" line.

Upvotes: 2

Related Questions