Reputation: 822
I have a text file and want to output rows where the first 4 columns appear exactly three times in the file.
chr1 1 A T sample1
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 T A sample1
chr3 4 T A sample1
chr1 1 A T sample2
chr2 3 T A sample2
chr3 4 T A sample2
chr1 1 A T sample3
chr2 1 G C sample3
chr3 4 T A sample3
chr1 1 A T sample4
chr2 1 G C sample4
chr5 1 A T sample4
chr5 2 G C sample4
If a row appears three times I want to add two columns for the other two samples that it appears in so the output from above would look like this:
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
I would do this in R but the file is too large to read in so I am looking for a solution that would work in linux. I have been looking into awk but cannot find anything for this exact situation.
The file is not currently sorted.
Thanks in advance!
edit: Thanks for all these informative answers. I selected the one that was most familiar to how I am used to working but the other answers look great too and I will learn from them.
Upvotes: 0
Views: 134
Reputation: 246992
For fun, a solution using sqlite (Wrapped in a shell script that takes the data file as its only argument)
#!/bin/sh
file="$1"
# Consider loading your data into a persistent db if doing a lot of work
# on it, instead of a temporary one like this.
sqlite3 -batch -noheader <<EOF
.mode tabs
CREATE TEMP TABLE data(c1, c2 INTEGER, c3, c4, c5);
.import "$file" data
-- Not worth making an index for a one-off run, but for
-- repeated use would come in handy.
-- CREATE INDEX data_idx ON data(c1, c2, c3, c4);
SELECT c1, c2, c3, c4, group_concat(c5, char(9)/*tab*/)
FROM data
GROUP BY c1, c2, c3, c4
HAVING count(*) = 3
ORDER BY c1, c2, c3, c4;
EOF
Then:
$ ./demo.sh input.tsv
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
Upvotes: 3
Reputation: 203807
This may be what you're looking for:
$ cat tst.awk
BEGIN { FS=OFS="\t" }
{ curr = $1 FS $2 FS $3 FS $4 }
curr != prev {
prt()
cnt = samples = ""
prev = curr
}
{ samples = (cnt++ ? samples " " : "") $5 }
END { prt() }
function prt() { if ( cnt == 3 ) print prev samples }
.
$ sort -k1,4 file | awk -f tst.awk
chr2 1 G C sample1 sample3 sample4
chr3 4 T A sample1 sample2 sample3
sort
uses paging etc. to handle input that's too large to fit in memory so it will successfully handle larger input than other tools can handle and the awk
script is storing almost nothing in memory.
Upvotes: 2
Reputation: 4698
Using GNU datamash
, tr
and awk
assuming that input and output are tab-separated:
$ datamash -s -g1,2,3,4 collapse 5 < file | tr ',' '\t' | awk 'NF==7'
chr3 4 T A sample1 sample2 sample3
First, use datamash
to sort the input file, group on the first four fields and collapse the values (comma-separated) on the 5th field.
The output would look like this:
$ datamash -s -g1,2,3,4 collapse 5 < file
chr1 1 A T sample1,sample2,sample3,sample4
chr1 3 G C sample1
chr2 1 G C sample1
chr2 2 G C sample3,sample4
chr2 2 T A sample1
chr2 3 T A sample2
chr3 4 T A sample1,sample2,sample3
chr5 1 A T sample4
chr5 2 G C sample4
Then pipe the output to tr
to convert the commas to tabs and finally use awk
to print the rows with seven fields.
Using awk
:
awk '
BEGIN{ FS=OFS="\t" }
{
idx=$1 FS $2 FS $3 FS $4
cnt[idx]++
data[idx]=(cnt[idx]==1 ? "" : data[idx] OFS) $5
}
END{
for (i in cnt)
if (cnt[i]==3) print i, data[i]
}
' file
Maintain two arrays using the first four fields as index.
The first increments a counter whenever a record with the same index is encountered and the second appends the 5th field using a tab as separator.
In the end block, loop over the cnt
array and print the index and the value of the data
array if the count is three.
Upvotes: 3