Reputation: 97
I have two files both of which are tab delimited. One of the file is almost 800k lines and it is a An Exonic Coordinates file and the other file is almost 200k lines (It is a VCF File).
I am writing a code in python to find and filter the position in the VCF that is within an exonic coordinates (Exon Start and End from Exonic Coordinates File) and writes it to a file.
However, because the files are big, it took a couple of days to get the filtrated output file?
So the code below is partially solve the issue of speed but the problem is to figure out is to speed the filtration process which is why I used a break to exit the second loop and I want to start from the beginning of the outer loop instead taking the next element from the first loop (outer loop)?
Here is my code:
import
import sys
list_coord = []
with open('ref_ordered.txt', 'rb') as csvfile:
reader = csv.reader(csvfile, delimiter='\t')
for row in reader:
list_coord.append((row[0],row[1],row[2]))
def parseVcf(vcf,src):
done = False
with open(vcf,'r') as f:
reader=csv.reader((f),delimiter='\t')
vcf_out_split = vcf.split('.')
vcf_out_split.insert(2,"output_CORRECT2")
outpt = open('.'.join(vcf_out_split),'a')
for coord in list_coord:
for row in reader:
if '#' not in row[0]:
coor_genom = int(row[1])
coor_exon1 = int(coord[1])+1
coor_exon2 = int(coord[2])
coor_genom_chr = row[0]
coor_exon_chr = coord[0]
ComH = row[7].split(';')
for x in ComH:
if 'DP4=' in x:
DP4_split=x[4:].split(',')
if (coor_exon1 <= coor_genom <= coor_exon2):
if (coor_genom_chr == coor_exon_chr):
if ((int(DP4_split[2]) >= 1 and int(DP4_split[3]) >= 1)):
done = True
outpt.write('\t'.join(row) + '\n')
if done:
break
outpt.close()
for root,dirs,files in os.walk("."):
for file in files:
pathname=os.path.join(root,file)
if file.find("1_1")==0:
print "Parsing " + file
parseVcf(pathname, "1_1")
ref_ordered.txt:
1 69090 70008
1 367658 368597
1 621095 622034
1 861321 861393
1 865534 865716
1 866418 866469
1 871151 871276
1 874419 874509
1_1 Input File:
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT directory
1 14907 rs79585140 A G 20 . DP=10;VDB=5.226464e-02;RPB=-6.206015e-01;AF1=0.5;AC1=1;DP4=1,2,5,2;MQ=32;FQ=20.5;PV4=0.5,0.07,0.16,0.33;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.312;CADD=1.415;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=61 GT:PL:GQ 0/1:50,0,51:50
1 14930 rs75454623 A G 44 . DP=9;VDB=7.907652e-02;RPB=3.960091e-01;AF1=0.5;AC1=1;DP4=1,2,6,0;MQ=41;FQ=30.9;PV4=0.083,1,0.085,1;DN=131;DA=A/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.000;CG=-1.440;CADD=1.241;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=38 GT:PL:GQ 0/1:74,0,58:61
1 15211 rs78601809 T G 9.33 . DP=6;VDB=9.014600e-02;RPB=-8.217058e-01;AF1=1;AC1=2;DP4=1,0,3,2;MQ=21;FQ=-37;PV4=1,0.35,1,1;DN=131;DA=T/G;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-0.145;CADD=1.611;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-frequency,by-cluster;DSP=171 GT:PL:GQ 1/1:41,10,0:13
1 16146 . A C 25 . DP=10;VDB=2.063840e-02;RPB=-2.186229e+00;AF1=0.5;AC1=1;DP4=7,0,3,0;MQ=39;FQ=27.8;PV4=1,0.0029,1,0.0086;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=unknown;CP=0.001;CG=-0.555;CADD=2.158;AA=A;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=197 GT:PL:GQ 0/1:55,0,68:58
1 16257 rs78588380 G C 40 . DP=18;VDB=9.421102e-03;RPB=-1.327486e+00;AF1=0.5;AC1=1;DP4=3,11,4,0;MQ=50;FQ=43;PV4=0.011,1,1,1;DN=131;DA=G/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.001;CG=-2.500;CADD=0.359;AA=G;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DSP=308 GT:PL:GQ 0/1:70,0,249:73
1 16378 rs148220436 T C 39 . DP=7;VDB=2.063840e-02;RPB=-9.980746e-01;AF1=0.5;AC1=1;DP4=0,4,0,3;MQ=50;FQ=42;PV4=1,0.45,1,1;DN=134;DA=T/C;GM=NR_024540.1;GL=WASH7P;FG=intron;FD=intron-variant;CP=0.016;CG=-2.880;CADD=0.699;AA=T;CN=dgv1e1,dgv2n71,dgv3e1,esv27265,nsv428112,nsv7879;DV=by-cluster;DSP=227 GT:PL:GQ 0/1:69,0,90:72
OUTPUT File:
1 877831 rs6672356 T C 44.8 . DP=2;VDB=6.720000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;DN=116;DA=T/C;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=missense,missense,missense,missense,missense,intron;FD=unknown;AAC=TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,TRP/ARG,none;PP=343/682,343/715,328/667,327/666,234/573,NA;CDP=1027,1027,982,979,700,NA;GS=101,101,101,101,101,NA;PH=0;CP=0.994;CG=2.510;CADD=0.132;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-cluster,by-1000G;DSP=38;CPG=875731-878363;GESP=C:8470/T:0;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 878000 . C T 44.8 . DP=2;VDB=7.520000e-02;AF1=1;AC1=2;DP4=0,0,1,1;MQ=50;FQ=-33;GM=NM_152486.2,XM_005244723.1,XM_005244724.1,XM_005244725.1,XM_005244726.1,XM_005244727.1;GL=SAMD11;FG=synonymous,synonymous,synonymous,synonymous,synonymous,intron;FD=unknown;AAC=LEU,LEU,LEU,LEU,LEU,none;PP=376/682,376/715,361/667,360/666,267/573,NA;CDP=1126,1126,1081,1078,799,NA;CP=0.986;CG=3.890;CADD=2.735;AA=C;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DSP=62;CPG=875731-878363;PAC=NP_689699.2,XP_005244780.1,XP_005244781.1,XP_005244782.1,XP_005244783.1,NA GT:PL:GQ 1/1:76,6,0:10
1 881627 rs2272757 G A 205 . DP=9;VDB=1.301207e-01;AF1=1;AC1=2;DP4=0,0,5,4;MQ=50;FQ=-54;DN=100;DA=G/A;GM=NM_015658.3,XM_005244739.1;GL=NOC2L;FG=synonymous;FD=synonymous-codon,unknown;AAC=LEU;PP=615/750,615/755;CDP=1843;CP=0.082;CG=5.170;CADD=0.335;AA=G;CN=dgv10n71,dgv2n67,dgv3e1,dgv8n71,dgv9n71,essv2408,essv4734,nsv10161,nsv428334,nsv509035,nsv517709,nsv832980,nsv871547,nsv871883;DG;DV=by-frequency,by-cluster,by-1000G;DSP=40;GESP=A:6174/G:6830;PAC=NP_056473.2,XP_005244796.1 GT:PL:GQ 1/1:238,27,0:51
Upvotes: 0
Views: 165
Reputation: 104712
If both of your files are ordered, you can save a lot of time by iterating over them in parallel, always advancing the one with lowest coordinates. This way you will only handle each line once, not many times.
Here's a basic version of your code that only does the coordinate checking (I don't fully understand your DP4
condition, so I'll leave it to you to add that part back in):
with open(coords_fn) as coords_f, open(vcf_fn) as vcf_f, open(out_fn) as out_f:
coords = csv.reader(coords_f, delimiter="\t")
vcf = csv.reader(vcf_f, delimiter="\t")
out = csv.writer(out_f, delimiter="\t")
next(vcf) # discard header row, or use out.writeline(next(vcf)) to preserve it!
try:
c = next(coords)
r = next(vcf)
while True:
if int(c[1]) >= int(r[1]): # vcf file is behind
r = next(vcf)
elif int(c[2]) < int(r[1]): # coords file is behind
c = next(coords)
else: # int(c[1]) < int(r[1]) <= int(c[2])
out.writeline(r) # add DP4 check here, and indent this line under it
r = next(vcf) # don't indent this line
except StopIteration: # one of the files has ended
pass
Upvotes: 0
Reputation: 366
First of all, I did not include any code because it looks like homework to me (I have had homework like this). I will however try to explain the steps I took to improve my scripts, even though I know my solutions are far from perfect.
your script could be slow because for every line in your csv file you open, write and close your output file. Try to make a list of lines you want to add to the output file, and after you are done with reading and filtering, then start writing.
You also might want to consider to write functions per filter and call these functions with the line as variable. That way you can easily add filters later on. I use a counter to keep track of the amount of succeeded filters and if in the end counter == len(amountOfUsedFilers)
I add my line to the list.
Also, why do you use outpt = open('.'.join(vcf_out_split),'a')
and with open(vcf,'r') as f:
try to be consistent and smart in your choices.
Bioinformatics for the win!
Upvotes: 1