Drew
Drew

Reputation: 103

How to search from specific starting location in AppleScript

I'm trying to search for a string in a very long bit of text. Normally I'd do something along these lines:

set testString to "These aren't the droids you're looking for. Now we have a ridiculously large amount of text. These ARE the DROIDS you're looking for."

set searchTerm to "droids"
set searchTermLength to count of characters in searchTerm

# Gets string from first appearance of searchTerm
set testStringSearch to characters 19 thru -1 of testString as text

# Finds location of next appearance of searchTerm
set testLocation to offset of searchTerm in testStringSearch

# Returns next location of searchTerm
set theTest to characters testLocation thru (testLocation + searchTermLength) of testStringSearch as text
return theTest

However, the amount of text is so large (120k+ characters) that when I try to set testStringSearch, it hangs for a while.

Since I'm going to be creating a loop where it returns each location of searchTerm, I would like to avoid that lost time, if possible. Is there something I'm missing?

Upvotes: 0

Views: 189

Answers (1)

Darrick Herwehe
Darrick Herwehe

Reputation: 3722

Your biggest bottleneck is when you strip off the beginning of the string:

set testStringSearch to characters 19 thru -1 of testString as text

Assuming an average word length of 5 characters, this is creating a list of almost 600,000 characters, then turning that list back into text.

Your best bet would be to turn the string into data you can work with upfront and use that data for the rest of the script. As an example, you could split the string on the target search word and use the remaining string lengths to create a list of offsets:

set offsets to allOffsets("A sample string", "sample")
--> {3}

on allOffsets(str, target)
    set splitString to my explode(str, target)
    set offsets to {}
    set compensation to 0
    set targetLength to length of target
    repeat with i from 1 to ((count splitString) - 1)
        set currentStringLength to ((length of item i of splitString))
        set end of offsets to currentStringLength + compensation + 1
        set compensation to compensation + currentStringLength + targetLength
    end repeat
    return offsets
end allOffsets


on explode(theText, theDelim)
    set AppleScript's text item delimiters to theDelim
    set theList to text items of theText
    set AppleScript's text item delimiters to ""
    return theList
end explode

As you can see, to get the current offset, you're taking the length of the string + 1, then in the compensation variable, you're keeping track of the length of all previous strings you have already processed.

Performance

I did find that performance is directly linked to how many occurrences are found in the string. My test data was made up of 20,000 words from a Lorem Ipsum generator.

Run 1:

Target: "lor"
Found:  141 Occurrences
Time:   0.01 seconds

Run 2:

Target: "e"
Found:  6,271 Occurrences
Time:   1.97 seconds

Run 3:

Target: "xor"
Found:  0 Occurrences
Time:   0.00 seconds

Upvotes: 2

Related Questions