Reputation: 453
I have a file that has 2 columns like the following:
apple pear
banana pizza
spoon fork
pizza plate
sausage egg
If a word appears on multiple lines i want to delete all lines that the repeating word appears, as you can see 'pizza' appears twice so 2 lines should be deleted, the following is the required output:
apple pear
spoon fork
sausage egg
I am aware of using :
awk '!seen[$1]++'
However this only removes the lines when the string appears in one column, i require a command that will check both columns. How can i achieve this?
Upvotes: 1
Views: 948
Reputation: 58488
This might work for you (GNU grep,sort,uniq,sed):
sed 's/ /\n/g' file | sort |uniq -d | grep -vFf - file
Or a toy GNU sed solution:
cat <<\! | sed -Ef - file
H # copy file into hold space
$!d # delete each line of the original file
g # at EOF replace pattern space with entire file
y/ /\n/; # put each word on a separate line
# make a list of duplicate words, space separated
:a;s/^(.*\n)(\S+)(\n.*\b\2\b)/\2 \1\3/;ta
s/\n.*// # remove adulterated file leaving list of duplicates
G # append original file to list
# remove lines with duplicate words
:b;s/^((\S+) .*)\n[^\n]*\2[^\n]*/\1/;tb
s/^\S+ //;tb # reduce duplicate word list
s/..// # remove newline artefacts
!
Upvotes: 0
Reputation: 19625
This works with your sample:
#!/usr/bin/env sh
filename='x.txt'
for dupe in $(xargs -n1 -a "${filename}" | sort | uniq -d); do
sed -i.bak -e "/\\<${dupe}\\>/d" "${filename}"
done
It builds a list of words that appears more than once in the file:
xargs -n1 -a "${filename}"
Outputs the list of all words| sort
Sorts the list| uniq -d
Outputs only words that appears more than once into consecutive linesThen uses sed
to select and delete all lines containing the duped word.
Upvotes: 0
Reputation: 26521
Using awk, you can keep track of many things. Not only if you have seen a word, but also which line the word has been seen on. We keep track of a couple of arrays.
record
: keeps track of every line we parsedseen
: keeps track of the various words as well as the first record number it has been seen onThis gives us:
awk '{ record[NR]=$0 }
{ for(i=1;i<=NF;++i) {
if ($i in seen) { delete record[NR]; delete record[seen[$i]] }
else { seen[$i]=NR }
}
}
END { for(i=1;i<=NR;++i) if (i in record) print record[i] }' file
How does this work?
record[NR]=$0
: store the record $0
in an array record
indexed by the record number NR
record
as well as the current record. If it has not been seen, store the word and the current record number in the array seen
.record
, print that record.Upvotes: 2
Reputation: 204258
$ awk '
NR==FNR {
for (i=1; i<=NF;i++) {
if ( firstNr[$i] ) {
multi[NR]
multi[firstNr[$i]]
}
else {
firstNr[$i] = NR
}
}
next
}
!(FNR in multi)
' file file
apple pear
spoon fork
sausage egg
or if you prefer:
$ awk '
NR==FNR {
for (i=1; i<=NF;i++) {
cnt[$i]++
}
next
}
{
for (i=1; i<=NF;i++) {
if ( cnt[$i] > 1 ) {
next
}
}
print
}
' file file
apple pear
spoon fork
sausage egg
Upvotes: 2
Reputation: 27295
You could solve the problem in multiple steps by using grep
and uniq -d
.
First, generate a list of all words using something like grep -Eo '[^ ]+'
. Then filter that list so that only duplicated words remain. Filtering can be done using … | sort | uniq -d
. Finally, print all lines that do not contain any word from the list previously generated using grep -Fwvf listFile inputFile
.
In bash
all these steps can run in one single command. Here we will use the variable $in
to make it easily adaptable.
in="path/to/your/input/file"
grep -Fwvf <(grep -Eo '[^ ]+' "$in" | sort | uniq -d) "$in"
Upvotes: 5