Reputation: 13
i have a list of query and hits gi in one file (file1) . i have another file in which complete name of hits is there(file2), now i want to replace Hits gi from file1 to file2 that have the complete Hit name. i want that gi must be replace with the same gi in front of it's each corresponding Query.
file1
1 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_
2 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_
3 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_
4 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_
5 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_
file2
1 >gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
2 >gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
3 >gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv]
4 >gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
5 >gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
desired output:
1 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820_ref_YP_001281343.1_ hypothetical protein MRA_0062 [Mycobacterium tuberculosis H37Ra]
2 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250_ref_YP_001286004.1_ hypothetical protein TBFG_10059 [Mycobacterium tuberculosis F11]
3 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202_ref_NP_214574.1_ Conserved hypothetical protein [Mycobacterium tuberculosis H37Rv
4 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975_ref_YP_003029976.1_ hypothetical protein TBMG_00059 [Mycobacterium tuberculosis KZN 1435]
5 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260_ref_YP_005098527.1_ hypothetical protein TBSG_00059 [Mycobacterium tuberculosis KZN 4207]
Upvotes: 0
Views: 143
Reputation: 949
One of the fastest (and dummy) solutions is to use the regex package re
from Python to match the patterns in your string. I wrote an example of how you could do it (you must do some checks to see if the results are right...):
import re
file2 = open(f2path, "r")
file1 = open(f1path, "r")
file3 = open(f3path, "w")
namesD = dict()
for lineO in file2:
strH = re.search(" ", line0)
idN = line0[1:strH.begin()]
namesD[idN] = line0[strH.end():]
for lineO in file1:
strH = re.search("Hit=", line0)
idN = line0[strH.end():].strip().replace(' ', '_')
if namesD[idN] :
file3.write("Hit=" + idN + namesD[idN])
The idea is to first extract the ids and their names from file 2 and add them in a dict (the id is the key, the name is the value) and afterwards, you should read line by line the first file and extract the ID from the hit and try to match it in the dict. If they match, you can write the result in a 3rd file... or do whatever you want with it.
Upvotes: 0
Reputation: 8587
If I run:
file1=file1.txt; file2=$(cat file2.txt|sed -e "s/>gi/Query=gi/g"|sed -e "s/_ref_/ ref_/g");IFS='\n';echo $file2| awk 'NR==FNR { _[$2]=$2; f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10 } NR!=FNR { if(_[$2] != "") print $0" "f1_line[key]}' - $file1
To explain what it is doing as a script, usage described below, I set the file as file1.rasta in the script so it would require input from me:
./run.sh
-------------------------------------------------------------------------------
No variables defined settings files as:
fil1=file1.rasta
file2=file2.rasta
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
One of the following:
file1=file1.rasta
file2=file2.rasta
does not exist!
-------------------------------------------------------------------------------
-------------------------------------------------------------------------------
usage:
./run.sh file1.fasta file2.fasta is the same as line below
./run.sh ./file1.fasta ./file2.fasta
-- This is if files are elsewhere
./run.sh /path/to/file1.fasta /path/to/file2.fasta
-------------------------------------------------------------------------------
Running it:
./run.sh ./file1.fasta ./file2.fasta
1 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148659820 ref_YP_001281343.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
2 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_148821250 ref_YP_001286004.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
3 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_15607202 ref_NP_214574.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
4 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_253796975 ref_YP_003029976.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
5 Query=gi_148659820 ref_YP_001281343.1_ Hit=gi_375294260 ref_YP_005098527.1_ hypothetical protei TBFG_10059 [Mycobacterium tuberculosis F11]
bash script run.sh which is the 1 liner above but broken down with explanation:
#!/bin/bash
function line() {
echo -e "-------------------------------------------------------------------------------"
}
function usage() {
line;
echo "usage:"
echo $0 file1.fasta file2.fasta is the same as line below
echo $0 ./file1.fasta ./file2.fasta
echo -- This is if files are elsewhere
echo $0 /path/to/file1.fasta /path/to/file2.fasta
line;
}
file1=$1;
file2=$2;
if [ $# -lt 2 ]; then
# Set file1 variable as filename file1.fasta
# ensure this file exists in current path
# otherwise:
# file1=/path/to/file1.fasta
file1=file1.rasta;
# Set file2 variable as filename file2.fasta
# ensure this file exists in current path
# otherwise:
# file2=/path/to/file1.fasta
file2=file2.rasta;
line;
echo -e "No variables defined settings files as:\nfil1=$file1\nfile2=$file2";
line;
fi
# Check we have both files whether its variables or if not variables
# matches defined files
if [ ! -f $file1 ] || [ ! -f $file2 ]; then
line;
echo -e "One of the following: \n file1=$file1\nfile2=$file2\n does not exist!"
line
usage
exit 2;
fi
# Define file 2 variable which cats file2.fasta again like above ensure
# the file2.fasta can be catted from this path, it pipes it into sed and changes:
# '>gi' to 'Query=gi' and also changes '_ref_' to ' ref_'
# this now matches the same pattern as file1
cfile2=$(sed -e "s/>gi/Query=gi/g" -e "s/_ref_/ ref_/g" $file2);
# Set the internal field separator to \n which is the output of variable file2
IFS='\n';
# debug enable this if you now want to see manipulated file2
# echo $cfile2
# Echo out cfile2 which now with the above ifs makes it like the file
# formatting making \n the separator - pipe into awk command which
# matches against both files
# Set up a key whilst in one which contains pattern match after:
# .{number}_{space}* where this is what separates file2's content where tag starts.
# If the values from $2 match on both lines print out $0 which is everything from file1
# plus the key which contains the details
# the echo $cfile2 is then represented as - before $file1 at the end in effect its the first file value which is the call to file1
echo $cfile2| awk 'NR==FNR {
_[$2]=$2;
if( match($0, /\.[0-9]\_ /)) {
var1=substr($0, RSTART+3);
}
}
NR!=FNR {
if(_[$2] != "") print $0" "var1
}' - $file1
## Method used originally - updated to above which is much cleaner
## pattern matches and then from that point it captures entire string which would
## ensure it captures the entire tag from file2
##echo $cfile2| awk 'NR==FNR {
## _[$2]=$2;
## f1_line[key] = $4" "$5" "$6" "$7" "$8" "$9" "$10
## }
## NR!=FNR {
## if(_[$2] != "") print $0" "f1_line[key]
## }' - $file1
Upvotes: 0
Reputation: 48
The solution is described stepwise;
Extract only Hit GIs from file1;
cat file1 | awk '{print $3}' | sed 's/Hit=//g' > file1-gi
Remove # >
from file 2.;
sed 's/^....//g' file2 > file2_1
Remove redundancy in file2, if any;
cat file2_1 | sort $1 | uniq > file2_2
Use system command to grep the names of corresponding GIs;
cat file1-gi | awk '{system ("grep "$1" file2_2")}' >> file1-gi-name
Printing starting 3 columns of file1;
cut -d" " -f-3 file1 > file1_1
Paste two files;
paste file1_1 file1-gi-name > output
Upvotes: 0