user3557715
user3557715

Reputation: 123

Filtering out the rows in certain condition

I am trying to filter out the rows that has certain value that matches the value from other file. I'd appreciate for a help.

My data looks like this:

File1:

  Group   Position Code     Answer  c1     c2    c3    c4   
  1       3        s1_60    A       etc    etc   etc   etc
  2       4        s2_63    T       etc2_  etc2  etc2/ etc2'
  3       5        s1_23    A       etc3   etc3  etc3* etc3
  3       51       s7_52    T       etc4   etc4_ etc4  etc4^

File2:

>1
ATGCGCGCGCGCGATATATTGCTGATATATATGCCTTttaagatcaatat
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGCGAGAGAGAGAGAtgtgttgtagataGACGAG
>2
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCaaaaaaGAGAGAGAGAGAtgtgttgtagataGACG
>3
tattagccccatgtgttgaagaacaaatctctctgttaaacagaaattgg
gggggaaaataaacaggggggcaaataattctgactacaattgtatatat
ggatatattGCGCGCGCGccggcgcgcgAGAtgtgttgtagataGACGAG

'Group' refers to the number after '>' on 'File2' while 'Position' refers to the position of the letter under the designated group. I want to keep only the rows that have the matching letter from 'File2' in 'Answer' column.

Therefore, output would look like this:

newOutput:

Group   Position  Code      Answer  c1     c2    c3    c4
  2       4        s2_63    T       etc2_  etc2  etc2/ etc2'
  3       5        s1_23    A       etc3   etc3  etc3* etc3
  3       51       s7_52    T       etc4   etc4_ etc4  etc4^

First line on 'File1' was not included because it has 'A' instead of 'K'

I would appreciate for any help. I am thinking about starting with either awk or python. I have never organized data with multiple files involved so it is kind of frustrating for me. Please advise me.

Upvotes: 0

Views: 97

Answers (2)

inspectorG4dget
inspectorG4dget

Reputation: 114035

import csv

with open("File2") as infile:
    d = {}
    bases = ''
    group = None
    for line in infile:
        line = line.strip()
        if line.startswith(">"):
            if group is not None:
                d[group] = bases
            group = int(line[1:])
            bases = ''
            continue
        bases += line
    d[group] = bases.upper()

with open("File1") as infile, open('output', 'w') as outfile:
    reader = csv.reader(infile, delimiter='\t')
    writer = csv.writer(outfile, delimiter='\t')
    writer.writerow(next(reader))
    for g, pos, code, answer, *rest in reader:
        g = int(g)
        pos = int(pos)
        if d[g][pos-1] == ans:
            writer.writerow([g, pos, code, answer] + rest)

Upvotes: 1

qwwqwwq
qwwqwwq

Reputation: 7329

Here's an awk solution:

BEGIN {
    GROUP=1;
    BASE=2;
}
NR == FNR {
    positions[$1"_"$2]=toupper($3)
}

NR != FNR {
    if($0 ~ /^>/) {
        group=substr($0, 2, length($0));
    } else {
        gsub(" ", "", $0);
        seqs[group]=seqs[group]$0;
    }
}

END {
    print "Group","Position","Answer"
    for(current_group in seqs) {
        for(key in positions) {
            split(key,position,"_");
            if(position[GROUP] == current_group) {
                if(toupper(substr(seqs[group],position[BASE],1)) \
                        == positions[key]) {
                    print position[GROUP],
                          position[BASE],
                          positions[key];
                }
            }
        }
    }
}

awk -f script.awk File1 File2

output:

Group Position Answer
2 4 T
3 5 A

Position 51 of group 3 appears to be a G, not a T, so my output is different from yours.

Upvotes: 1

Related Questions