jdc0589
jdc0589

Reputation: 7018

Linux shell script to count occurance of char sequence in a text file?

I have a a large text file (over 70mb) and need to count the number of times a character sequence occurs in the file. I can find plenty of scripts to do this, but NONE OF THEM take in to account that a sequence can start and finish on different lines. For the sake of efficiency (I actually have way more than 1 file I am processing), I can not preprocess the files to remove newlines.

Example: If I am searching for "thisIsTheSequence", the following file would have 3 matches:

asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

Thanks for the help.

Upvotes: 1

Views: 2662

Answers (4)

ghostdog74
ghostdog74

Reputation: 342363

just one awk script will do, since you will processing a huge file. Doing multiple pipes can slow down things.

#!/bin/bash
awk 'BEGIN{
 search="thisIsTheSequence"
 total=0
}
NR%10==0{
  c=gsub(search,"",s)
  total+=c  
}
NR{ s=s $0 }
END{ 
 c=gsub(search,"",s)
 print "total count: "total+c
}' file

output

$ more file
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasdaasdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda
asdasdthisIsTheSequence
asdasdasthisIsT
heSequenceasdasdthisIsTheSequ
encesadasdasda

$ ./shell.sh
total count: 9

Upvotes: 2

Artelius
Artelius

Reputation: 49089

Is there ever going to be more than one newline in your sequence?

If not, one solution would be to split your sequence in half and search for the halves (e.g. search for "thisIsTh" and also for "eSequence"), then go back to the occurrences you find and take a "closer look", i.e. strip out the newlines in that area and check for a match.

Basically this is a kind of fast "filtering" of the data to find something interesting.

Upvotes: 0

Preet Sangha
Preet Sangha

Reputation: 65496

use something like:

head -n LL filename | tail -n YY | grep text | wc -l

where LL is the last line of the sequence and YY is the number of lines in the sequence (i.e. LL - first line)

Upvotes: -1

bdonlan
bdonlan

Reputation: 231133

One option:

echo $((`tr -d "\n" < file | sed 's/thisIsTheSequence/\n/g' | wc -l` - 1))

There are probably more efficient methods using utilities outside the core of shell - particularly if you can fit the file in memory.

Upvotes: 7

Related Questions