Reputation: 35
so I have a file that looks like this :
/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
LITPRAAVPALKRPALKASLPASSSHGNWETF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
CDS complement(471..590)
/db_xref="SEED:fig|1240086.14.peg.2"
/translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
/product="hypothetical protein"
CDS 717..2354
/db_xref="SEED:fig|1240086.14.peg.3"
/translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
/product="Methyl-accepting chemotaxis protein I (serine
chemoreceptor protein)"
/product="macromolecule metabolism; macromolecule
degradation; degradation of proteins, peptides,
glycopeptides"
I need to extract the text that is between quotes after a "/product=", so I need this :
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
I have to use awk, so I wrote this :
awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'
but this only takes the info on the same line as "/product", and some times the info is on two or three lines.. I'm out of ideas as to how to get the entire info between the quotes, anyone can help?
Upvotes: 1
Views: 182
Reputation: 67467
awk
to the rescue! needs multi-char RS
support (gawk
)
$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
Explanation set the record structure (either starts with "/" or " CDS", find related records (starting with product), trim extra spaces and print the field between two quotes (second field based on set field delimiter to double quotes).
Upvotes: 1
Reputation: 203334
With GNU awk for multi-char RS and RT:
$ gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
Upvotes: 0
Reputation: 1446
assuming file name is file.txt
echo $(cat file.txt ) | sed 's/\//\n/g' | grep product | sed 's/product="//g;s/".*//'
Explanation :
merge all line into one line
echo $(cat file.txt )
replace "/" as new line
echo $(cat file.txt ) | sed 's///\n/g'
grep line which having line Product
echo $(cat file.txt ) | sed 's///\n/g' | grep product
replace "product=" and all acaracter after double quote
echo $(cat file.txt ) | sed 's///\n/g' | grep product | sed 's/product="//g;s/".*//'
Upvotes: -1
Reputation: 19315
can be done with GNU grep, output separated by \0
0 byte
grep -Pzo '/product="\K[^"]*' | tr -s '\0\t\n' '\n '
or perl replacing multiple (spaces, newlines or tabs) by a single space, separated with newlines
perl -0777 -ne 'print s/\s+/ /gr."\n" for /\/product="\K[^"]*/g'
Upvotes: 1
Reputation: 92854
Awk
solution:
awk -v RS='"' '!(NR%2) && f{ gsub(/[[:space:]]+/," "); print }
/\/[[:alnum:]_-]+=$/{ f=(/product=/? 1:0) }' file
-v RS='"'
- treat double quote "
as record separator!(NR%2)
- on each even linegsub(/[[:space:]]+/," ")
- remove extra whitespaces(s)f=(/product=/? 1:0)
- set the flag f
in active state 1
on /product= ...
linesThe output:
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides
Upvotes: 1