user1889034
user1889034

Reputation: 353

sed: how to determine if line 1 is contained in line 2

My text file is sorted alphabetically. I want to determine if each line is contained within the following line, and if so, delete the first of the two. So, for example, if I had...

car 
car and trailer
train

... I want to end up with...

car and trailer
train

I found the "sed one-liners" page(s), which has the code to search out duplicate lines:

sed '$!N; /^(.*)\n\1$/!P; D'

... and I figured deleting the ^ would do the trick, but it didn't.

(It would also be nice to do this with non-consecutive lines, but my files run to thousands of lines, and it would probably take a script hours, or days, to run.)

Upvotes: 2

Views: 279

Answers (3)

Ed Morton
Ed Morton

Reputation: 204628

sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:

awk '$0 !~ prev{print prev} {prev=$0} END{print}' file

Upvotes: 2

Joseph Quinsey
Joseph Quinsey

Reputation: 9972

You said:

It would also be nice to do this with non-consecutive lines.

Here is a bash script to remove all shorter lines contained within another line, not necessarily consecutive, case-insensitive:

#!/bin/bash
# sed with I and Q are gnu extensions:
cat test.txt | while read line; do
   echo Searching for: $line
   sed -n "/.$line/IQ99;/$line./IQ99" test.txt # or grep -i
   if [ $? -eq 99 ]; then
      echo Removing: $line
      sed -i "/^$line$/d" test.txt
   fi   
done

Test:

$ cat test.txt
Boat
Car
Train and boat
car and cat

$ my_script
Searching for: Boat
Removing: Boat
Searching for: Car
Removing: Car
Searching for: Train and boat
Searching for: car and cat

$ cat test.txt
Train and boat
car and cat

Upvotes: 0

TheRuss
TheRuss

Reputation: 318

The original command

sed '$!N; /^\(.*\)\n\1$/!P; D'

Looks for an exact line match. As you want to check if the first line is contained in the second, you need to add some wild cards:

sed '$!N; /^\(.*\)\n.*\1.*$/!P; D'

Should do it.

Upvotes: 2

Related Questions