Reputation: 161
Ive a text file with data in 7 columns in this format:
18030 AAJ51 FTO rs9939609 C__30090620_10 A T
18030 AAJ51 CAT rs1001179 C__11468118_10 C C
18030 AAJ51 CCL2 rs1024611 C___2590362_10 G G
18030 AAJ51 TAS2R38 rs10246939 C___9506826_10 C C
20287 AAJ51 FTO rs9939609 C__30090620_10 A T
20287 AAJ51 CAT rs1001179 C__11468118_10 C C
20287 AAJ51 CCL2 rs1024611 C___2590362_10 A G
20287 AAJ51 TAS2R38 rs10246939 C___9506826_10 T T
The 2nd, 3rd 4th and 5th columns are constant and repeat.
The variables are the 1st, 6th and 7th columns.
I would like to transpose the data in this way:
FTO CAT CCL2 TAS2R38
rs9939609 rs1001179 rs1024611 rs10246939
18030 AT CC GG AT
20287 AT CC AG TT
Whilst the example shows 4 rows per ID (the 5-digit number in first column is the ID) the actual file has 128 rows per ID so performing a match or regex would not be practical and prefer a method that iterates over a number of rows.
I saw this example on converting n number of rows but am unsure how to modify for this application.
UPDATE: CRLF endings can cause formatting problems which can be resolved using a tool like dos2unix
Upvotes: 2
Views: 374
Reputation: 92904
GNU Awk
solution:
awk '{
if (!keys[$3]++) { b[++c] = $3; row1 = row1 OFS $3; row2 = row2 OFS $4 }
line = groups[$1][$3];
groups[$1][$3] = (line == ""? $6$7: line OFS $6$7)
}
END{
print row1 ORS row2;
for (i in groups) {
r = i;
for (j in b) r = r OFS groups[i][b[j]];
print r
}
}' OFS='\t' file | column -txn
The output:
FTO CAT CCL2 TAS2R38
rs9939609 rs1001179 rs1024611 rs10246939
18030 AT CC GG CC
20287 AT CC AG TT
Upvotes: 3
Reputation: 54393
Try using a last_seen variable and an array.
Upvotes: 1