Reputation: 680

How to remove items that show up as substring later in file?

Given some file that has,

foo/bar
foo/bar/gaz
foo/bar/urk
hello/world
hello/world/congress
hello/world/united/states
hello/world

How can I remove lines which have previous lines as substrings?

For example, foo/bar/gaz has foo/bar - a previous line - as substring, and should be removed.

The above list should be reduced to,

foo/bar
hello/world

(This is kind of like common denominator for lines in a file)

Upvotes: 1

Answers (5)

potong

Reputation: 58578

This might work for you (GNU sed):

sed -E 'G;/^([^\n]+).*\n\1(\n.*)*$/d;h;P;d' file

Stuff unique lines in the hold space and delete lines that partialy/fully match those lines.

Upvotes: 2

dawg

Reputation: 104092

Here is an awk that may be faster if your file is larger:

awk 'BEGIN { FS=OFS="/" } 
                $0 in arr { next }
                {   s=$1
                    for (i=2; i<=NF; i++) {
                        if (s in arr || (s OFS $i) in arr) next
                        s=s OFS $i}
                    arr[$0]} 1' file

Instead of looping over the entire array contents for each line of input, this loops over the substrings of each line and tests that for presence in the array of previous substrings.

Upvotes: 1

Walter A

Reputation: 20032

When you have a line foo/bar, you want to delete everything with foo/bar..
Just add a dot to every line and use that for the exclusion list.

grep -vf <(sed 's/$/./' file) file

Upvotes: 1

John1024

Reputation: 113994

Try:

$ awk '{for (s in a) if (s == substr($0,1,length(s))) next; print; a[$0]}' file
foo/bar
hello/world

The previous lines, excluding the those that are substrings of other lines, are the keys of array a. for (s in a) if (s == substr($0,1,length(s))) next checks to see if the current line, $0, is a substring of a previous line. If so, we skip this line and jump to the next line.

If the current line is not a substring of a previous line, then we print it and add it as a key of a.

Another example

$ cat file2
/etc
/foo/bar/etc
$ awk '{for (s in a) if (s == substr($0,1,length(s))) next; print; a[$0]}' file2
/etc
/foo/bar/etc

The code in this answer treats the "common denominator" as starting from the beginning of the string. Thus /etc is not a "common denominator" for /foo/bar/etc even though both have the common substring /etc.

Upvotes: 2

thanasisp

Reputation: 5975

You can use awk.

awk '{for (i in a) if ($0 ~ i) next} {a[$0]}1' file

Output:

foo/bar
hello/world

Upvotes: 3

How to remove items that show up as substring later in file?

Answers (5)

Another example

Related Questions