Reputation: 23
I am not an expert with Linux, but looking at different posts in various forums, I have been trying to write a script to match pattern of characters occurring together in a file. My file has approximately 200 million characters (upper and lower case), with about 50 characters per line. I have merged all the lines together to make it one line using
tr -d '\n' < input.txt > oneLineInput.txt
This gets all the characters in my file to the same line without spaces.
I am trying to count the number of times the specific characters occur together. For example, in the file below
IamTryingtobuildascriptfortrestingthetyposinmysentence
I am trying to look for the pattern 'tr' that occurs in the sentence. The script I have now is
grep -o -i oneLineInput.txt -e tr | sort | uniq -c
The above script works perfectly fine for a small file, but when I try to run it on my actual file with more than 200 million characters, it takes ages to finish the task (I lost patience and did not check the total time taken).
I have also been trying to get the position of the match. For example, in the above example file, 'tr' is starts on 4th and 27th position.
Thank you
Upvotes: 1
Views: 469
Reputation: 74615
Here's another way that you could do it using awk:
{
while (match($0, /[Tt][Rr]/)) {
++n
m += RSTART
$0 = substr($0, RSTART + RLENGTH)
printf "match %d: position %d\n", n, m + n - 1
}
}
match
stores the position of the first match in the variable RSTART
and the length of the match in RLENGTH
. n
keeps a count of the number of matches. substr
is used to remove the match from the start of the string. The position to be printed must be offset by n - 1
.
Output:
$ awk -f matches.awk file
match 1: position 4
match 2: position 27
Upvotes: 0
Reputation: 41456
This awk
will show how many tr
you have in the oneLineInput.txt
awk -F"[Tt][Rr]" '{print NF-1}' oneLineInput.txt
2
To get the position:
awk -F"[Tt][Rr]" 'BEGIN {print "hit\tposition"} {for (i=1;i<NF;i++) {p+=length($i);print ++a"\t"p+1+(a-1)*2}}' oneLineInput.txt
hit position
1 4
2 27
To get the position: p+1+(a-1)*2
p
incremental length of fields
+1
since tr
comes after the length of the field.
(a-1)*2
number of hits -1 multiple length of data to search tr
= 2
characters.
Upvotes: 1