Dinedal
Dinedal

Reputation: 2696

How do I delete the current line based on a match in both the previous and current lines in sed?

Given a sorted file like so:

AAA 1 2 3
AAA 2 3 4
AAA 3 4 2
BBB 1 1 1
BBB 1 2 1

and a desired output of

AAA 1 2 3
BBB 1 1 1

what's the best way to achieve this with sed?

Basically, if the col starts with the same field as the previous line, how do I delete it? The rest of the data must be kept on the output.

I imagine there must be some way to do this either using the hold buffer, branching, or the test command.

Upvotes: 0

Views: 114

Answers (6)

Using sed:

#!/bin/sed -nf

P

: loop
s/\s.*//
N
/\([^\n][^\n]*\)\n\1/ b loop

D

Firstly, we must pass the -n flag to sed so it will only print what we tell it to.

We start off by printing the line with the "P" command, because the first line will always be printed and we will force sed to only execute this line when we want it to.

Now we will do a loop. We define a loop with a starting label through the ":" command (in this case we name the label as "loop"), and when necessary we jump back to this label with a "b" command (or a "t" test command). This loop is quite simple:

  1. Remove everything but the first field (replace the first space character and everything that follows it with nothing)
  2. Append the next line (a newline character will be included)
  3. Check if the new line starts with the field we isolated. We do this by using a capture. A capture is defined as a "sub-match" whose matched input will be stored into a special "variable", named numerically following the order of captures present. We specify captures using parenthesis escaped with backslased (starts with \( and ends with \)). In this case we match all characters that aren't a newline character (ie. [^\n]) up to the end of the line. We do this by matching at least one of non-newline characters followed by an arbitrary sequence of them. This prevents matching an empty string before a newline. After the capture, we match a newline character followed by the result of the capture, by using the special variable \1, which contains the input matched by that first capture. If this succeeds, we have a line that repeats the first field, so we jump back to the start of the loop with the "b" branch command.
  4. When we exit the loop, we have found a line that has a different first field, so we must prepare the input line and jump back to the beginning of the script. This can be done with the "D" delete-first-line-and-restart-script command.

This can be shortened into a single line (notice that we have renamed the "loop" label into "a"):

sed -e 'P;:a;s/\s.*//;N;/\([^\n][^\n]*\)\n\1/ba;D'

Upvotes: 0

Steve
Steve

Reputation: 54402

One way using GNU awk:

awk '!array[$1]++' file.txt

Results:

AAA 1 2 3
BBB 1 1 1

Upvotes: 0

potong
potong

Reputation: 58430

This might work for you (GNU sed):

sed -r ':a;$!N;s/^((\S+\s).*)\n\2.*/\1/;ta;P;D' file

or maybe just:

sort -uk1,1 file

Upvotes: 0

user647772
user647772

Reputation:

This could be done with AWK:

$ gawk '{if (last != $1) print; last = $1}' in.txt
AAA 1 2 3
BBB 1 1 1

Upvotes: 1

Lev Levitsky
Lev Levitsky

Reputation: 65791

Maybe there's a simpler way with sed, but:

sed ':a;N;/\([[:alnum:]]*[[:space:]]\).*\n\1/{s/\n.*//;ta};P;D'

This produces the output

AAA 1 2 3
BBB 1 1 1

which differs from that in the question, but matches the description:

if the col starts with the same field as the previous line, how do I delete it?

Upvotes: 0

Kent
Kent

Reputation: 195079

another way with awk:

awk '!($1 in a){print;a[$1]}' file

Upvotes: 1

Related Questions