Reputation: 43
I have x lines like this:
Unable to find latest released revision of 'CONTRIB_046578'.
And I need to extract the word between the revision of '
and '
in this example the word CONTRIB_046578
and if possible count the number of occurrences of that word using grep
, sed
or any other command?
Upvotes: 4
Views: 2746
Reputation: 678
If the test file below is representative of the file in the actual problem then the following may be useful.
On the basis that each line in the test file is homogeneous - that is, well-formatted and containing 8 columns (or fields) - a handy solution using the cut
command would be as follows:
file:
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'
Code:
cut -d ' ' -f 8 file | tr -d "'" | sort | uniq -c
Output:
1 CONTRIB_046570
2 CONTRIB_046572
1 CONTRIB_046578
3 CONTRIB_046579
Note on the code: the default delimiter used by cut
to separate each field is tab
, but since we require the delimiter to be a single space to separate each field, we specify the option -d ' '
. The rest of the code is similar to other answers, so I won't repeat what's been said.
General note: this code will probably not reach the desired output if the file is not well-formatted as I already mentioned above.
Upvotes: 1
Reputation: 204258
All you need is a very simple awk script to count the occurrences of what's between the quotes:
awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file
Using @anubhava's test input file:
$ cat file
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'
$
$ awk -F\' '{c[$2]++} END{for (w in c) print w,c[w]}' file
CONTRIB_046578 1
CONTRIB_046579 3
CONTRIB_046570 1
CONTRIB_046572 2
Upvotes: 1
Reputation: 36650
Assumptions:
Input file:
$ cat test.txt
Unable to find latest released revision of 'CONTRIB_046578'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046579'.
Unable to find latest released revision of 'CONTRIB_046570'.
Unable to find latest released revision of 'CONTRIB_046572'.
Unable to find latest released revision of 'CONTRIB_046578'.
Shell script to filter and count the words:
$ sed "s/.*'\(.*\)'.*/\1/" test.txt | sort | uniq -c
1 CONTRIB_046570
2 CONTRIB_046572
2 CONTRIB_046578
1 CONTRIB_046579
Upvotes: 0
Reputation: 85875
The cleanest solution is with grep -Po "(?<=')[^']+(?=')"
$ cat file
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'foo'
Unable to find latest released revision of 'bar'
Unable to find latest released revision of 'CONTRIB_046578'
# Print occurences
$ grep -Po "(?<=')[^']+(?=')" file
CONTRIB_046578
foo
bar
CONTRIB_046578
# Count occurences
$ grep -Pc "(?<=')[^']+(?=')" file
4
# Count unique occurrences
$ grep -Po "(?<=')[^']+(?=')" file | sort | uniq -c
2 CONTRIB_046578
1 bar
1 foo
Upvotes: 8
Reputation: 785721
Here is one awk script that you can use to extract and count the frequency of each word in single quote:
awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}}
END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile
cat infile
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046578'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046570'
Unable to find latest released revision of 'CONTRIB_046579'
Unable to find latest released revision of 'CONTRIB_046572'
Unable to find latest released revision of 'CONTRIB_046579'
OUTPUT:
awk '{for (i=1; i<=NF; i++) {if ($i ~ /^'"'.*?'"'/ ) cnt[$i]++;}}
END {for (a in cnt) {b=a; gsub(/'"'"'/, "", b); print b, cnt[a]}}' infile
CONTRIB_046579 3
CONTRIB_046578 1
CONTRIB_046570 1
CONTRIB_046572 2
Upvotes: 1