Reputation: 39
I have a text file in the following format. Each row has variable number of columns.
File:
gi|269201691|ref|YP_003280960.1| chromosomal replication initiation protein gi|57651109|ref|YP_184912.1| chromosomal replication initiation protein % 1 0.0 2296 100.0
gi|269201692|ref|YP_003280961.1| DNA polymerase III subunit beta gi|57651110|ref|YP_184913.1| DNA polymerase III subunit beta % 1 0.0 1964 100.0
The resulting file should look like the following:
gi|269201691|ref|YP_003280960.1| gi|57651109|ref|YP_184912.1| % 1 0.0 2296 100.0
gi|269201694|ref|YP_003280963.1| gi|57651112|ref|YP_184915.1| % 1 0.0 1767 100.0
The code below helps find columns in each row with the pattern 'ref'.
awk '{for (i=1;i<=NF;i++) if ($i ~ /ref/) print $i }'
Any ideas on how to do the same?
Upvotes: 1
Views: 226
Reputation: 54392
Here's one way using GNU awk
:
awk 'BEGIN { OFS=FS="|" } { for (i=1; i<=NF; i++) if ($i ~ / gi$/) $i = " gi"; if (i = NF) sub(/.*%/," %",$i) }1' file.txt
Here's one way using GNU sed
:
sed 's/|[^|]* gi|/| gi|/; s/\(.*|\).*\(%.*\)/\1 \2/' file.txt
Results:
gi|269201691|ref|YP_003280960.1| gi|57651109|ref|YP_184912.1| % 1 0.0 2296 100.0
gi|269201692|ref|YP_003280961.1| gi|57651110|ref|YP_184913.1| % 1 0.0 1964 100.0
Upvotes: 0
Reputation: 58391
This might work for you (GNU sed):
sed 's/\(.*|.*|.*|.*|\)\(.*\)\(\S\+|.*|.*|.*|\)\2%/\1\3%/' file
If the input file has multiline records:
sed 'N;s/\n//;s/\(.*|.*|.*|.*|\)\(.*\)\(\S\+|.*|.*|.*|\)\2%/\1\3%/' file
Upvotes: 0
Reputation: 5067
I am assuming that your newlines got mangled in your post, and that your input file actually has just one entry per line. In that case, I think this does what you want:
awk -F '[|%]' '{printf("%s|%d|%s|%s|",$1,$2,$3,$4);if($6)printf(" %%%s",$6);printf("\n")}'
Edit: Ok, in light of the new line numbers, what you want is probably this:
awk -F '[|%]' '{printf("gi|%d|ref|%s|gi|%d|ref|%s| %%%s\n",$2,$4,$6,$8,$10)}'
For your example, this produces the following output for me
gi|269201691|ref|YP_003280960.1|gi|57651109|ref|YP_184912.1| % 1 0.0 2296 100.0
gi|269201692|ref|YP_003280961.1|gi|57651110|ref|YP_184913.1| % 1 0.0 1964 100.0
This works by manually setting the field separator to be | or %. Hence, the variable number of words in the description is no longer a problem, and we can directly index the fields we want.
Upvotes: 1