RFVoltolini
RFVoltolini

Reputation: 390

SED - Deleting occurences of first line through the rest of the file

I'm stuck in something that looks like it should be simple to SED.

I have some (kind of) CSV files that I get from another application, so I cannot control its output. Some preprocessing is already done with SED, but I am stuck on the last one. So I wish to do it with SED, if possible, to avoid using a third application.

The problem is that the heading line of the file (first line) is repeated along the file, but unfortunately with the following characteristics:

  1. The heading of each CSV file is unknown previously. Each file have its own heading, that might be different from each other;
  2. Not always repetition occurs on every N lines (being N a fixed known number)
  3. Other data (non heading) lines might be repeated, and should be maintained

So, suppose I have the following 2 files:

Cash.csv

Name; Amount
John; 3.55
Erick; 4.76
John; 8.99
Name; Amount
Erick; 4.76
Mark; 1.00
Name; Amount
John; 3.55

Check.csv

Name; Account; Amount
Erick; 345344; 123.00
Mark; 88849; 323.50
Name; Account; Amount
John; 474473; 99.00
Mark; 88849; 323.50
Mark; 88849; 323.50
John; 474473; 99.00

What I wish is a single SED script that applied to each file turn them into:

Cash.processed.csv

Name; Amount
John; 3.55
Erick; 4.76
John; 8.99
Erick; 4.76
Mark; 1.00
John; 3.55

Check.processed.csv

Name; Account; Amount
Erick; 345344; 123.00
Mark; 88849; 323.50
John; 474473; 99.00
Mark; 88849; 323.50
Mark; 88849; 323.50
John; 474473; 99.00

I was wondering if its possible to use SED "hold buffer" as a pattern on the delete command:

1h     #Hold the first line (headings)
/\h/d  #Use hold buffer as a pattern to delete

Supposing "\h" would return the hold buffer to the delete command.

Thanks for any replies;

PS: Please don't answer with the following over-specific command:

1p;/Name; Amount\|Name; Account; Amout/d

Upvotes: 1

Views: 200

Answers (3)

potong
potong

Reputation: 58371

This might work for you (GNU sed):

sed '1h;1!{G;/^\(.*\)\n\1/d;s/\n.*//}' file

Explanation:

  • 1h store the heading line in the hold space (HS) and print.
  • 1!{G;/^\(.*\)\n\1/d;s/\n.*//} for every line but the first, append a newline followed by the contents of the HS (i.e. the heading line). Compare the first part of the line to the heading line and if it's the same delete that line. If it's not delete the appended newline and heading line and print as normal.

EDIT:

This is indeed very slow on large files, a quicker and perhaps easier to understand solution is:

sed 's|.*|1!{/^&$/d}|;q' file | sed -f - file

This makes a sed script from the first line of the input file.

Upvotes: 2

Vijay
Vijay

Reputation: 67211

In case if you are interested in awk:

awk '{if(NR==1){p=$0;print}if(NR>1 && p!=$0)print}' your_file

Upvotes: 1

Jonathan Leffler
Jonathan Leffler

Reputation: 753525

I think you'll need to capture the first line from one sed command and then use that in the main operational command:

line1=$(sed 1q $datafile)

sed -e "2,$ {/$line1/d;}" \
    -e '...rest of sed script...' $datafile

Because the sed 1q quits after reading the first line, it is quick regardless of how big the data file is. If there's a chance that the first line might contain a slash (heading "Name/Number", perhaps) or other regex metacharacters, then think of using something like this, which replaces all slashes with .:

line1=$(sed '1{s%/%.%g;q;}' $datafile)

I did some futzing with the Mac OS X (10.8.1) version of sed, which is fussier than GNU sed. In the second (main) sed command, the match had to be in {...}, the dollar had to be separate (or the shell gets antsy about invalid parameter substitution), and the semi-colon was needed. Some of those restrictions probably aren't needed with GNU sed, but the code shown is likely to work anywhere.

Upvotes: 4

Related Questions