Reputation: 123
I am trying to filter out the rows that has certain value that matches the value from other file. I'd appreciate for a help.
My data looks like this:
File1:
Group Position Code Answer c1 c2 c3 c4
1 3 s1_60 A etc etc etc etc
2 4 s2_63 T etc2_ etc2 etc2/ etc2'
3 5 s1_23 A etc3 etc3 etc3* etc3
3 51 s7_52 T etc4 etc4_ etc4 etc4^
File2:
>1
ATGCGCGCGCGCGATATATTGCTGATATATATGCCTTttaagatcaatat
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGCGAGAGAGAGAGAtgtgttgtagataGACGAG
>2
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCaaaaaaGAGAGAGAGAGAtgtgttgtagataGACG
>3
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGccggcgcgcgAGAtgtgttgtagataGACGAG
'Group' refers to the number after '>' on 'File2' while 'Position' refers to the position of the letter under the designated group. I want to keep only the rows that have the matching letter from 'File2' in 'Answer' column.
Therefore, output would look like this:
newOutput:
Group Position Code Answer c1 c2 c3 c4
2 4 s2_63 T etc2_ etc2 etc2/ etc2'
3 5 s1_23 A etc3 etc3 etc3* etc3
3 51 s7_52 T etc4 etc4_ etc4 etc4^
First line on 'File1' was not included because it has 'A' instead of 'K'
I would appreciate for any help. I am thinking about starting with either awk or python. I have never organized data with multiple files involved so it is kind of frustrating for me. Please advise me.
Upvotes: 0
Views: 97
Reputation: 114035
import csv
with open("File2") as infile:
d = {}
bases = ''
group = None
for line in infile:
line = line.strip()
if line.startswith(">"):
if group is not None:
d[group] = bases
group = int(line[1:])
bases = ''
continue
bases += line
d[group] = bases.upper()
with open("File1") as infile, open('output', 'w') as outfile:
reader = csv.reader(infile, delimiter='\t')
writer = csv.writer(outfile, delimiter='\t')
writer.writerow(next(reader))
for g, pos, code, answer, *rest in reader:
g = int(g)
pos = int(pos)
if d[g][pos-1] == ans:
writer.writerow([g, pos, code, answer] + rest)
Upvotes: 1
Reputation: 7329
Here's an awk solution:
BEGIN {
GROUP=1;
BASE=2;
}
NR == FNR {
positions[$1"_"$2]=toupper($3)
}
NR != FNR {
if($0 ~ /^>/) {
group=substr($0, 2, length($0));
} else {
gsub(" ", "", $0);
seqs[group]=seqs[group]$0;
}
}
END {
print "Group","Position","Answer"
for(current_group in seqs) {
for(key in positions) {
split(key,position,"_");
if(position[GROUP] == current_group) {
if(toupper(substr(seqs[group],position[BASE],1)) \
== positions[key]) {
print position[GROUP],
position[BASE],
positions[key];
}
}
}
}
}
awk -f script.awk File1 File2
output:
Group Position Answer
2 4 T
3 5 A
Position 51 of group 3 appears to be a G
, not a T
, so my output is different from yours.
Upvotes: 1