user1921608
user1921608

Reputation: 43

Get word between quotes

I have x lines like this:

Unable to find latest released revision of 'CONTRIB_046578'.   

And I need to extract the word between the revision of ' and ' in this example the word CONTRIB_046578 and if possible count the number of occurrences of that word using grep, sed or any other command?

Upvotes: 4

Views: 2746

Answers (6)

Graeme Walsh
Graeme Walsh

Reputation: 678

If the test file below is representative of the file in the actual problem then the following may be useful.

On the basis that each line in the test file is homogeneous - that is, well-formatted and containing 8 columns (or fields) - a handy solution using the cut command would be as follows:

file:

Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'

Code:

cut -d ' ' -f 8 file | tr -d "'" | sort | uniq -c

Output:

1 CONTRIB_046570
2 CONTRIB_046572
1 CONTRIB_046578
3 CONTRIB_046579

Note on the code: the default delimiter used by cut to separate each field is tab, but since we require the delimiter to be a single space to separate each field, we specify the option -d ' '. The rest of the code is similar to other answers, so I won't repeat what's been said.

General note: this code will probably not reach the desired output if the file is not well-formatted as I already mentioned above.

Upvotes: 1

Ed Morton
Ed Morton

Reputation: 204258

All you need is a very simple awk script to count the occurrences of what's between the quotes:

awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file

Using @anubhava's test input file:

$ cat file
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'
$
$ awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file
CONTRIB_046578 1
CONTRIB_046579 3
CONTRIB_046570 1
CONTRIB_046572 2

Upvotes: 1

Andreas Fester
Andreas Fester

Reputation: 36650

Assumptions:

  • Each word can occur multpiple times, and OP wants to count the number of occurrences of each word.
  • There are no other lines in the file

Input file:

$ cat test.txt 
Unable to find latest released revision of 'CONTRIB_046578'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046579'.
Unable to find latest released revision of 'CONTRIB_046570'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046578'.

Shell script to filter and count the words:

$ sed "s/.*'\(.*\)'.*/\1/" test.txt | sort | uniq -c
  1 CONTRIB_046570
  2 CONTRIB_046572
  2 CONTRIB_046578
  1 CONTRIB_046579

Upvotes: 0

Chris Seymour
Chris Seymour

Reputation: 85875

The cleanest solution is with grep -Po "(?<=')[^']+(?=')"

$ cat file
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'foo'
Unable to find latest released revision of 'bar'
Unable to find latest released revision of 'CONTRIB_046578'

# Print occurences 
$ grep -Po "(?<=')[^']+(?=')" file
CONTRIB_046578
foo
bar
CONTRIB_046578

# Count occurences
$ grep -Pc "(?<=')[^']+(?=')" file
4

# Count unique occurrences 
$ grep -Po "(?<=')[^']+(?=')" file | sort | uniq -c 
2 CONTRIB_046578
1 bar
1 foo

Upvotes: 8

anubhava
anubhava

Reputation: 785721

Here is one awk script that you can use to extract and count the frequency of each word in single quote:

awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
      END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile

TESTING

cat infile
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'

OUTPUT:

 awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}} 
      END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile

CONTRIB_046579 3
CONTRIB_046578 1
CONTRIB_046570 1
CONTRIB_046572 2

Upvotes: 1

Bohemian
Bohemian

Reputation: 425238

sed 's/.*\'(.*?)\'.*/$1/' myfile.txt

Upvotes: 0

Related Questions