jane doe
jane doe

Reputation: 35

awk : extracting a data which is on several lines

so I have a file that looks like this :

/translation="MDGVTQQNAALVQEATTAAASLEEQARNLTAAVAAFDLGDKQTV
                 LITPRAAVPALKRPALKASLPASSSHGNWETF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
 CDS             complement(471..590)
                 /db_xref="SEED:fig|1240086.14.peg.2"
                 /translation="MHQYQSAILAKICRYGGIEKPEITPASVYKLDSHWRYVI"
                 /product="hypothetical protein"
 CDS             717..2354
                 /db_xref="SEED:fig|1240086.14.peg.3"
                 /translation="MGFFVVLWGGASGFSLYSLKQVTTLLHDNSTQGRTYTYLVYGND
                 QYFRSVTRMARVMDYSQFSDAAIASLEEQAQQLTKAVEVFHLGSEYQTAAS
                 RTRPAGNMALKRPALSGMAPALPPARTASDEGSWEKF"
                 /product="Methyl-accepting chemotaxis protein I (serine
                 chemoreceptor protein)"
                 /product="macromolecule metabolism; macromolecule
                 degradation; degradation of proteins, peptides,
                 glycopeptides"

I need to extract the text that is between quotes after a "/product=", so I need this :

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

I have to use awk, so I wrote this :

awk '/\/product/ {split($0, a, "\""); printf a[2] "\n"}'

but this only takes the info on the same line as "/product", and some times the info is on two or three lines.. I'm out of ideas as to how to get the entire info between the quotes, anyone can help?

Upvotes: 1

Views: 182

Answers (5)

karakfa
karakfa

Reputation: 67467

awk to the rescue! needs multi-char RS support (gawk)

$ awk -v RS='/| CDS' -F'"' '/^product/{gsub("\n +"," "); print $2}' file


Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

Explanation set the record structure (either starts with "/" or " CDS", find related records (starting with product), trim extra spaces and print the field between two quotes (second field based on set field delimiter to double quotes).

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 203334

With GNU awk for multi-char RS and RT:

$ gawk -v RS='/product="[^"]+"' -F'"' 'RT{$0=RT; gsub(/\s+/," "); print $2}' file
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

Upvotes: 0

Sujit Dhamale
Sujit Dhamale

Reputation: 1446

assuming file name is file.txt

echo $(cat file.txt ) | sed 's/\//\n/g' | grep product | sed 's/product="//g;s/".*//'

Explanation :

  1. merge all line into one line

    echo $(cat file.txt )

  2. replace "/" as new line

    echo $(cat file.txt ) | sed 's///\n/g'

  3. grep line which having line Product

    echo $(cat file.txt ) | sed 's///\n/g' | grep product

  4. replace "product=" and all acaracter after double quote

    echo $(cat file.txt ) | sed 's///\n/g' | grep product | sed 's/product="//g;s/".*//'

Upvotes: -1

Nahuel Fouilleul
Nahuel Fouilleul

Reputation: 19315

can be done with GNU grep, output separated by \0 0 byte

grep -Pzo '/product="\K[^"]*'  | tr -s '\0\t\n' '\n '

or perl replacing multiple (spaces, newlines or tabs) by a single space, separated with newlines

perl -0777 -ne 'print s/\s+/ /gr."\n" for /\/product="\K[^"]*/g'

Upvotes: 1

RomanPerekhrest
RomanPerekhrest

Reputation: 92854

Awk solution:

awk -v RS='"' '!(NR%2) && f{ gsub(/[[:space:]]+/," "); print }
               /\/[[:alnum:]_-]+=$/{ f=(/product=/? 1:0) }' file
  • -v RS='"' - treat double quote " as record separator
  • !(NR%2) - on each even line
  • gsub(/[[:space:]]+/," ") - remove extra whitespaces(s)
  • f=(/product=/? 1:0) - set the flag f in active state 1 on /product= ... lines

The output:

Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
hypothetical protein
Methyl-accepting chemotaxis protein I (serine chemoreceptor protein)
macromolecule metabolism; macromolecule degradation; degradation of proteins, peptides, glycopeptides

Upvotes: 1

Related Questions