biowizz
biowizz

Reputation: 23

Search, count and position of the count in a file

I am not an expert with Linux, but looking at different posts in various forums, I have been trying to write a script to match pattern of characters occurring together in a file. My file has approximately 200 million characters (upper and lower case), with about 50 characters per line. I have merged all the lines together to make it one line using

tr -d '\n' < input.txt > oneLineInput.txt

This gets all the characters in my file to the same line without spaces.

I am trying to count the number of times the specific characters occur together. For example, in the file below

IamTryingtobuildascriptfortrestingthetyposinmysentence

I am trying to look for the pattern 'tr' that occurs in the sentence. The script I have now is

grep -o -i oneLineInput.txt -e tr | sort | uniq -c

The above script works perfectly fine for a small file, but when I try to run it on my actual file with more than 200 million characters, it takes ages to finish the task (I lost patience and did not check the total time taken).

  1. Is there a way I can optimize the code?

I have also been trying to get the position of the match. For example, in the above example file, 'tr' is starts on 4th and 27th position.

  1. Is it possible to get the position of index as a number in the output.

Thank you

Upvotes: 1

Views: 469

Answers (2)

Tom Fenech
Tom Fenech

Reputation: 74615

Here's another way that you could do it using awk:

{ 
    while (match($0, /[Tt][Rr]/)) {
        ++n
        m += RSTART
        $0 = substr($0, RSTART + RLENGTH)
        printf "match %d: position %d\n", n, m + n - 1
    }
}

match stores the position of the first match in the variable RSTART and the length of the match in RLENGTH. n keeps a count of the number of matches. substr is used to remove the match from the start of the string. The position to be printed must be offset by n - 1.

Output:

$ awk -f matches.awk file
match 1: position 4
match 2: position 27

Upvotes: 0

Jotne
Jotne

Reputation: 41456

This awk will show how many tr you have in the oneLineInput.txt

awk -F"[Tt][Rr]" '{print NF-1}' oneLineInput.txt
2

To get the position:

awk -F"[Tt][Rr]" 'BEGIN {print "hit\tposition"} {for (i=1;i<NF;i++) {p+=length($i);print ++a"\t"p+1+(a-1)*2}}' oneLineInput.txt
hit     position
1       4
2       27

To get the position: p+1+(a-1)*2
p incremental length of fields
+1 since tr comes after the length of the field.
(a-1)*2 number of hits -1 multiple length of data to search tr = 2 characters.

Upvotes: 1

Related Questions