Vaxin
Vaxin

Reputation: 105

Bash: Remove unique and keep duplicate

I have a large file with 100k lines and about 22 columns. I would like to remove all lines in which the content in column 15 only appears once. So as far as I understand its the reverse of

sort -u file.txt

After the lines that are unique in column 15 are removed, I would like to shuffle all lines again, so nothing is sorted. For this I would use

shuf file.txt

The resulting file should include only lines that have at least one duplicate (in column 15) but are in a random order.

I have tried to work around sort -u but it only sorts out the unique lines and discards the actual duplicates I need. However, not only do I need the unique lines removed, I also want to keep every line of a duplicate, not just one representitive for a duplicate.

Thank you.

Upvotes: 3

Views: 4912

Answers (3)

RARE Kpop Manifesto
RARE Kpop Manifesto

Reputation: 2821

3, 5, and 6 are properly excluded from the output, no pre-sorting required, but output ordering not guaranteed to mirror input ordering ::

-- change $(_ = 1) to $(_ = 15)


echo '4
4
2
17
2
4
12
6
3
7
13
11
7
10
10
13
5
11
2
11' | gtee >( gsort -n | uniq -c >&2; ) | gcat - 

   3 2
   1 3
   3 4
   1 5
   1 6
   2 7
   2 10
   3 11

mawk '(__ = ___[$(_ = 1)])=="" ? \
      (NF =(___[$_] = $!_)<__) : __==(____ = "\6") || 
               ($!_ = __ ORS $!_)^(___[$_] = ____)'

 1  4
 2  4
 3  2
 4  2
 5  4
 6  7
 7  7
 8  10
 9  10
10  13
11  13
12  11
13  11
14  2
15  11

Upvotes: 0

islander
islander

Reputation: 11

Short version

awk '{if (seen[$15]++)print $0}' file.txt

Upvotes: 1

Barmar
Barmar

Reputation: 781096

Use uniq -d to get a list of all the duplicate values, then filter the file so only those lines are included.

awk -F'\t' 'NR==FNR { dup[$0]; next; } 
     $15 in dup' <(awk -F'\t' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt

awk '{print $15}' file.txt | sort | uniq -d returns a list of all the duplicate values in column 15.

The NR==FNR line in the first awk script turns this into an associative array.

The second line processes file.txt and prints any lines where column 15 is in the array.

Upvotes: 3

Related Questions