Reputation: 822

Using specific columns, output rows that are present 3 times in a text file

I have a text file and want to output rows where the first 4 columns appear exactly three times in the file.

chr1    1   A   T   sample1
chr1    3   G   C   sample1
chr2    1   G   C   sample1
chr2    2   T   A   sample1
chr3    4   T   A   sample1
chr1    1   A   T   sample2
chr2    3   T   A   sample2
chr3    4   T   A   sample2
chr1    1   A   T   sample3
chr2    1   G   C   sample3
chr3    4   T   A   sample3
chr1    1   A   T   sample4
chr2    1   G   C   sample4
chr5    1   A   T   sample4
chr5    2   G   C   sample4

If a row appears three times I want to add two columns for the other two samples that it appears in so the output from above would look like this:

chr2    1   G   C   sample1 sample3 sample4
chr3    4   T   A   sample1 sample2 sample3

I would do this in R but the file is too large to read in so I am looking for a solution that would work in linux. I have been looking into awk but cannot find anything for this exact situation.

The file is not currently sorted.

Thanks in advance!

edit: Thanks for all these informative answers. I selected the one that was most familiar to how I am used to working but the other answers look great too and I will learn from them.

Upvotes: 0

Answers (3)

glenn jackman

Reputation: 246992

For fun, a solution using sqlite (Wrapped in a shell script that takes the data file as its only argument)

#!/bin/sh

file="$1"

# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
EOF

Then:

$ ./demo.sh input.tsv
chr2    1   G   C   sample1 sample3 sample4
chr3    4   T   A   sample1 sample2 sample3

Upvotes: 3

Ed Morton

Reputation: 203807

This may be what you're looking for:

$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
    prt()
    cnt = samples = ""
    prev = curr
}
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }

$ sort -k1,4 file | awk -f tst.awk
chr2    1   G   C   sample1 sample3 sample4
chr3    4   T   A   sample1 sample2 sample3

sort uses paging etc. to handle input that's too large to fit in memory so it will successfully handle larger input than other tools can handle and the awk script is storing almost nothing in memory.

Upvotes: 2

Freddy

Reputation: 4698

Using GNU datamash, tr and awk assuming that input and output are tab-separated:

$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' '\t' | awk 'NF==7'
chr3    4       T       A       sample1 sample2 sample3

First, use datamash to sort the input file, group on the first four fields and collapse the values (comma-separated) on the 5th field. The output would look like this:

$ datamash -s -g1,2,3,4  collapse 5 < file
chr1    1       A       T       sample1,sample2,sample3,sample4
chr1    3       G       C       sample1
chr2    1       G       C       sample1
chr2    2       G       C       sample3,sample4
chr2    2       T       A       sample1
chr2    3       T       A       sample2
chr3    4       T       A       sample1,sample2,sample3
chr5    1       A       T       sample4
chr5    2       G       C       sample4

Then pipe the output to tr to convert the commas to tabs and finally use awk to print the rows with seven fields.

Using awk:

awk '
  BEGIN{ FS=OFS="\t" }
  {
    idx=$1 FS $2 FS $3 FS $4
    cnt[idx]++
    data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
  }
  END{
    for (i in cnt)
      if (cnt[i]==3) print i, data[i]
  }
' file

Maintain two arrays using the first four fields as index.
The first increments a counter whenever a record with the same index is encountered and the second appends the 5th field using a tab as separator.

In the end block, loop over the cnt array and print the index and the value of the data array if the count is three.

Upvotes: 3

Using specific columns, output rows that are present 3 times in a text file

Answers (3)

Related Questions