pulikot1
pulikot1

Reputation: 39

AWK :Extract columns from file: Rows have variable columns

I have a text file in the following format. Each row has variable number of columns.

File:

gi|269201691|ref|YP_003280960.1| chromosomal replication initiation protein                                                            gi|57651109|ref|YP_184912.1| chromosomal replication initiation protein                                                                   %           1        0.0           2296      100.0
gi|269201692|ref|YP_003280961.1| DNA polymerase III subunit beta                                                                       gi|57651110|ref|YP_184913.1| DNA polymerase III subunit beta                                                                              %           1        0.0           1964      100.0

The resulting file should look like the following:

gi|269201691|ref|YP_003280960.1| gi|57651109|ref|YP_184912.1| % 1        0.0           2296      100.0
gi|269201694|ref|YP_003280963.1| gi|57651112|ref|YP_184915.1| % 1        0.0           1767      100.0

The code below helps find columns in each row with the pattern 'ref'.

awk '{for (i=1;i<=NF;i++) if ($i ~ /ref/) print $i }'

Any ideas on how to do the same?

Upvotes: 1

Views: 226

Answers (3)

Steve
Steve

Reputation: 54392

Here's one way using GNU awk:

awk 'BEGIN { OFS=FS="|" } { for (i=1; i<=NF; i++) if ($i ~ / gi$/) $i = " gi"; if (i = NF) sub(/.*%/," %",$i) }1' file.txt

Here's one way using GNU sed:

sed 's/|[^|]* gi|/| gi|/; s/\(.*|\).*\(%.*\)/\1 \2/' file.txt

Results:

gi|269201691|ref|YP_003280960.1| gi|57651109|ref|YP_184912.1| % 1 0.0 2296 100.0
gi|269201692|ref|YP_003280961.1| gi|57651110|ref|YP_184913.1| % 1 0.0 1964 100.0

Upvotes: 0

potong
potong

Reputation: 58391

This might work for you (GNU sed):

sed 's/\(.*|.*|.*|.*|\)\(.*\)\(\S\+|.*|.*|.*|\)\2%/\1\3%/' file

If the input file has multiline records:

sed 'N;s/\n//;s/\(.*|.*|.*|.*|\)\(.*\)\(\S\+|.*|.*|.*|\)\2%/\1\3%/' file

Upvotes: 0

amaurea
amaurea

Reputation: 5067

I am assuming that your newlines got mangled in your post, and that your input file actually has just one entry per line. In that case, I think this does what you want:

awk -F '[|%]' '{printf("%s|%d|%s|%s|",$1,$2,$3,$4);if($6)printf(" %%%s",$6);printf("\n")}'

Edit: Ok, in light of the new line numbers, what you want is probably this:

awk -F '[|%]' '{printf("gi|%d|ref|%s|gi|%d|ref|%s| %%%s\n",$2,$4,$6,$8,$10)}'

For your example, this produces the following output for me

gi|269201691|ref|YP_003280960.1|gi|57651109|ref|YP_184912.1| % 1 0.0 2296 100.0
gi|269201692|ref|YP_003280961.1|gi|57651110|ref|YP_184913.1| % 1 0.0 1964 100.0

This works by manually setting the field separator to be | or %. Hence, the variable number of words in the description is no longer a problem, and we can directly index the fields we want.

Upvotes: 1

Related Questions