Reputation: 105
I have a large file with 100k lines and about 22 columns. I would like to remove all lines in which the content in column 15 only appears once. So as far as I understand its the reverse of
sort -u file.txt
After the lines that are unique in column 15 are removed, I would like to shuffle all lines again, so nothing is sorted. For this I would use
shuf file.txt
The resulting file should include only lines that have at least one duplicate (in column 15) but are in a random order.
I have tried to work around sort -u but it only sorts out the unique lines and discards the actual duplicates I need. However, not only do I need the unique lines removed, I also want to keep every line of a duplicate, not just one representitive for a duplicate.
Thank you.
Upvotes: 3
Views: 4912
Reputation: 2821
3
, 5
, and 6
are properly excluded from the output, no pre-sorting required, but output ordering not guaranteed to mirror input ordering ::
-- change $(_ = 1)
to $(_ = 15)
echo '4
4
2
17
2
4
12
6
3
7
13
11
7
10
10
13
5
11
2
11' | gtee >( gsort -n | uniq -c >&2; ) | gcat -
3 2
1 3
3 4
1 5
1 6
2 7
2 10
3 11
mawk '(__ = ___[$(_ = 1)])=="" ? \
(NF =(___[$_] = $!_)<__) : __==(____ = "\6") ||
($!_ = __ ORS $!_)^(___[$_] = ____)'
1 4
2 4
3 2
4 2
5 4
6 7
7 7
8 10
9 10
10 13
11 13
12 11
13 11
14 2
15 11
Upvotes: 0
Reputation: 781096
Use uniq -d
to get a list of all the duplicate values, then filter the file so only those lines are included.
awk -F'\t' 'NR==FNR { dup[$0]; next; }
$15 in dup' <(awk -F'\t' '{print $15}' file.txt | sort | uniq -d) file.txt > newfile.txt
awk '{print $15}' file.txt | sort | uniq -d
returns a list of all the duplicate values in column 15.
The NR==FNR
line in the first awk
script turns this into an associative array.
The second line processes file.txt
and prints any lines where column 15 is in the array.
Upvotes: 3