Reputation: 353
My text file is sorted alphabetically. I want to determine if each line is contained within the following line, and if so, delete the first of the two. So, for example, if I had...
car
car and trailer
train
... I want to end up with...
car and trailer
train
I found the "sed one-liners" page(s), which has the code to search out duplicate lines:
sed '$!N; /^(.*)\n\1$/!P; D'
... and I figured deleting the ^ would do the trick, but it didn't.
(It would also be nice to do this with non-consecutive lines, but my files run to thousands of lines, and it would probably take a script hours, or days, to run.)
Upvotes: 2
Views: 279
Reputation: 204628
sed is an excellent tool for simple substitutions on a single line, for anything else just use awk:
awk '$0 !~ prev{print prev} {prev=$0} END{print}' file
Upvotes: 2
Reputation: 9972
You said:
It would also be nice to do this with non-consecutive lines.
Here is a bash
script to remove all shorter lines contained within another line, not necessarily consecutive, case-insensitive:
#!/bin/bash
# sed with I and Q are gnu extensions:
cat test.txt | while read line; do
echo Searching for: $line
sed -n "/.$line/IQ99;/$line./IQ99" test.txt # or grep -i
if [ $? -eq 99 ]; then
echo Removing: $line
sed -i "/^$line$/d" test.txt
fi
done
Test:
$ cat test.txt
Boat
Car
Train and boat
car and cat
$ my_script
Searching for: Boat
Removing: Boat
Searching for: Car
Removing: Car
Searching for: Train and boat
Searching for: car and cat
$ cat test.txt
Train and boat
car and cat
Upvotes: 0
Reputation: 318
The original command
sed '$!N; /^\(.*\)\n\1$/!P; D'
Looks for an exact line match. As you want to check if the first line is contained in the second, you need to add some wild cards:
sed '$!N; /^\(.*\)\n.*\1.*$/!P; D'
Should do it.
Upvotes: 2