Ziv Attia
Ziv Attia

Reputation: 57

how to pull data from a vcf table

i have two files: SCR_location - which has information about a SNP location in an ascending order.

 19687

 36075

 n...

modi_VCF - a vcf table that has information about every SNP.

  19687  G A     xxx:255,0,195 xxx:255,0,206

  20398  G C     0/0:0,255,255 0/0:0,208,255

  n...

i want to save just the lines with the matching SNP location into a new file i wrote the following script but it doesn't work

cat SCR_location |while read SCR_l; do
    cat modi_VCF |while read line; do

            if  [ "$SCR_l" -eq "$line" ] ;
            then echo "$line" >> file
            else :
            fi

    done

done

Upvotes: 0

Views: 117

Answers (1)

tshiono
tshiono

Reputation: 22087

Would you please try a bash solution:

declare -A seen
while read -r line; do
    seen[$line]=1
done < SCR_location

while read -r line; do
    read -ra ary <<< "$line"
    if [[ ${seen[${ary[0]}]} ]]; then
        echo "$line"
    fi
done < modi_VCF > file
  • It first iterates over SCR_location and stores SNP locations in an associative array seen.
  • Next it scans modi_VCF and if the 1st column value is found in the associative array, then print the line.

If awk is your option, you can also say:

awk 'NR==FNR {seen[$1]++; next} {if (seen[$1]) print}' SCR_location modi_VCF > file

[Edit] In order to filter out the unmached lines, just negate the logic as:

awk 'NR==FNR {seen[$1]++; next} {if (!seen[$1]) print}' SCR_location modi_VCF > file_unmatched

The code above outputs the unmatched lines only. If you want to sort the matched lines and the unmatched lines at once, please try:

awk 'NR==FNR {seen[$1]++; next} {if (seen[$1]) {print >> "file_matched"} else {print >> "file_unmatched"} }' SCR_location modi_VCF

Hope this helps.

Upvotes: 1

Related Questions