Reputation: 453

How can i remove lines from a file when a string appears on multiple lines?

I have a file that has 2 columns like the following:

apple pear
banana pizza
spoon fork
pizza plate
sausage egg

If a word appears on multiple lines i want to delete all lines that the repeating word appears, as you can see 'pizza' appears twice so 2 lines should be deleted, the following is the required output:

apple pear
spoon fork
sausage egg

I am aware of using :

awk '!seen[$1]++'

However this only removes the lines when the string appears in one column, i require a command that will check both columns. How can i achieve this?

Upvotes: 1

Answers (5)

potong

Reputation: 58488

This might work for you (GNU grep,sort,uniq,sed):

sed 's/ /\n/g' file | sort |uniq -d | grep -vFf - file

Or a toy GNU sed solution:

cat <<\! | sed -Ef - file
H         # copy file into hold space
$!d       # delete each line of the original file
g         # at EOF replace pattern space with entire file
y/ /\n/;  # put each word on a separate line
# make a list of duplicate words, space separated
:a;s/^(.*\n)(\S+)(\n.*\b\2\b)/\2 \1\3/;ta
s/\n.*//  # remove adulterated file leaving list of duplicates
G         # append original file to list
# remove lines with duplicate words
:b;s/^((\S+) .*)\n[^\n]*\2[^\n]*/\1/;tb
s/^\S+ //;tb # reduce duplicate word list
s/..//    # remove newline artefacts
!

Upvotes: 0

Léa Gris

Reputation: 19625

This works with your sample:

#!/usr/bin/env sh
filename='x.txt'
for dupe in $(xargs -n1 -a "${filename}" | sort | uniq -d); do
  sed -i.bak -e "/\\<${dupe}\\>/d" "${filename}"
done

It builds a list of words that appears more than once in the file:

xargs -n1 -a "${filename}" Outputs the list of all words
contained in the file (one word per line)
| sort Sorts the list
| uniq -d Outputs only words that appears more than once into consecutive lines

Then uses sed to select and delete all lines containing the duped word.

Upvotes: 0

kvantour

Reputation: 26521

Using awk, you can keep track of many things. Not only if you have seen a word, but also which line the word has been seen on. We keep track of a couple of arrays.

record: keeps track of every line we parsed
seen: keeps track of the various words as well as the first record number it has been seen on

This gives us:

awk '{ record[NR]=$0 }
     { for(i=1;i<=NF;++i) {
         if ($i in seen) { delete record[NR]; delete record[seen[$i]] }
         else { seen[$i]=NR }
       }
     }
     END { for(i=1;i<=NR;++i) if (i in record) print record[i] }' file

How does this work?

record[NR]=$0: store the record $0 in an array record indexed by the record number NR
for each field/word of the record check if the word has been seen before. If it has been seen, delete the original record from the array record as well as the current record. If it has not been seen, store the word and the current record number in the array seen.
When the full file has been processed, check all possible record numbers we have seen, if it is still an index of the array record, print that record.

Upvotes: 2

Ed Morton

Reputation: 204258

$ awk '
    NR==FNR {
        for (i=1; i<=NF;i++) {
            if ( firstNr[$i] ) {
                multi[NR]
                multi[firstNr[$i]]
            }
            else {
                firstNr[$i] = NR
            }
        }
        next
    }
    !(FNR in multi)
' file file
apple pear
spoon fork
sausage egg

or if you prefer:

$ awk '
    NR==FNR {
        for (i=1; i<=NF;i++) {
            cnt[$i]++
        }
        next
    }
    {
        for (i=1; i<=NF;i++) {
            if ( cnt[$i] > 1 ) {
                next
            }
        }
        print
    }
' file file
apple pear
spoon fork
sausage egg

Upvotes: 2

Socowi

Reputation: 27295

You could solve the problem in multiple steps by using grep and uniq -d.

First, generate a list of all words using something like grep -Eo '[^ ]+'. Then filter that list so that only duplicated words remain. Filtering can be done using … | sort | uniq -d. Finally, print all lines that do not contain any word from the list previously generated using grep -Fwvf listFile inputFile.

In bash all these steps can run in one single command. Here we will use the variable $in to make it easily adaptable.

in="path/to/your/input/file"
grep -Fwvf <(grep -Eo '[^ ]+' "$in" | sort | uniq -d) "$in"

Upvotes: 5

How can i remove lines from a file when a string appears on multiple lines?

Answers (5)

Related Questions