Reputation: 313
I have been able to do what I want with one command one line, but I do know there must be some more elegant way to do what I am doing. Please tell me what your methods are... I would like to learn more sophisticated way of processing text files...
Original file is a vcf file looks like this
##fileformat=VCFv4.2
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20180307
##source=PLINKv1.90
##contig=<ID=1,length=249214117>
##contig=<ID=2,length=242842533>
##contig=<ID=3,length=197896741>
...
...
...
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
22 16258171 22:16258171:D:3 A . . . . GT
22 16258174 22:16258174:T:C T . . . . GT
22 16258183 22:16258183:A:T A . . . . GT
22 16258189 22:16258189:G:T G . . . . GT
My goal is to generate a file looks like this:
22 16258171 16258171 D 3
22 16258174 16258174 T C
22 16258183 16258183 A T
22 16258189 16258189 G T
22 16258211 16258211 A G
22 16258211 16258211 A T
22 16258220 16258220 T G
22 16258221 16258221 C T
22 16258224 16258224 C T
22 16258227 16258227 G A
I did the following steps to achieve the final goal but it's so cumbersome and so ugly too...
#remove comments
sed '/^[[:blank:]]*#/d;s/#.*//' chr22.vcf > no_comment_chr22.vcf
#take out the third columns for splitting
cut -d $'\t' -f 3 no_comment_chr22.vcf > no_comment_chr22.col3_to_split.txt
#Split string by delimiter and get N-th element, use as col4
cut -d':' -f3 no_comment_chr22.col3_to_split.txt > chr22_as_col4.txt
#Split string by delimiter and get N-th element, use as col5
cut -d':' -f4 no_comment_chr22.col3_to_split.txt > chr22_as_col5.txt
#get first 2 columns
cut -d $'\t' -f 1-2 no_comment_chr22.vcf > no_comment_chr22.col1to2.txt
#get the second column as col3
cut -d $'\t' -f 2 no_comment_chr22.vcf > no_comment_chr22.ascol3.txt
#Combine files column-wise
paste no_comment_chr22.col1to2.txt no_comment_chr22.ascol3.txt chr22_as_col4.txt chr22_as_col5.txt | column -s $'\t' -t > chr22_input_5cols.txt
I was able to get what I need but .. gahhh, this is so ugly. Please tell me what people do to advance their text processing skills and how to improve things like this.. thank you!!
Upvotes: 0
Views: 89
Reputation: 69396
Using awk
:
awk -F'(:| +)' '/^#/ {next} {print $1,$2,$4,$5,$6}' sample.vcf
22 16258171 16258171 D 3
22 16258174 16258174 T C
22 16258183 16258183 A T
22 16258189 16258189 G T
This is specifying a regular expression as the filed delimiter (-F
) and then ignoring the comment lines (^#
) or printing the corresponding fields (1,2,4,5,6).
Upvotes: 1
Reputation: 2491
You can try with this sed
sed -E '
/^#/d
s/(([0-9]*[[:blank:]]*){2})[^:]*((:[^:[[:blank:]]*){3}).*/\1\3/
s/:/ /g
s/[[:blank:]]{1,}/ /g
' infile
Upvotes: 0