Reputation: 95
I have a problem trying to execute shell scripts from apple script. I do a "grep", but as soon as it contains special characters it doesn't work as intended. (The script reads a list list ob subfolders in a directory and checks if any of the subfolders appear in a file.)
Here is my script:
set searchFile to "/tmp/output.txt"
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set theCommand to "grep -c " & quoted form of company & space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
The problem is e.g. with strings with umlauts. "theCommand" is somehow differently encoded that when I do it on the CLI directly.
$ grep -c 'Württemberg' '/tmp/output.txt' --> typed on command line
3
$ grep -c 'Württemberg' '/tmp/output.txt' --> copy & pasted from AppleScript
0
$ grep -c 'rttemberg' '/tmp/output.txt' --> no umlauts, no problems
3
The "ü" from the first and the second line are different; a echo 'Württemberg' | openssl base64
shows this.
I tried several encoding tricks at different places, basically everything I could find or think of.
Does anyone have any idea? How can I check which encoding a string has?
Thanks in advance! Sebastian
Upvotes: 1
Views: 2627
Reputation: 24982
This can work by escaping each character that has an accent in each company
name before they are used in the grep
command.
So, you'll need to escape each one of those characters (i.e. those which have an accent) with double backslashes (i.e. \\
). For example:
ü
in Württemberg
will need to become \\ü
ö
in Königsberg
will need to become \\ö
ß
in Einbahnstraße
will need to become \\ß
These accented characters, such as a u with diaeresis, are certainly getting encoded differently. Which type of encoding they receive is difficult to ascertain. My assumption is that the encoding pattern used begins with a backslash - hence why escaping those characters with backslashes fixes the issue. Consider the u with diaeresis in the previous link, it shows that for the C/C++ language the ü
is encoded as \u00FC
.
In the complete script below you'll notice the following:
set accentedChars to {"ü", "ö", "ß", "á", "ė"}
has been added to hold a list of all characters that will need to be escaped. You'll need to explicitly state each one as there doesn't seem to be a way to infer whether the character has an accent.Before assigning the grep
command to theCommand
variable we firstly escape the necessary characters via the line reading:
set company to escapeChars(company, accentedChars)
As you can see here we are passing two arguments to the escapeChars
sub-routine, (i.e. the non-escaped company
variable and the list of accented characters).
In the escapeChars
sub-routine we iterate over each char
in the accentedChars
list and invoke the findAndReplace
sub-routine. This will escape any instances of those characters with backslashes found in the company
variable.
Complete script:
set searchFile to "/tmp/output.txt"
set accentedChars to {"ü", "ö", "ß", "á", "ė"}
set theCommand to "/usr/local/bin/pdftotext -enc UTF-8 some.pdf" & ¬
space & searchFile
do shell script theCommand
tell application "Finder"
set companies to get name of folders of folder ("/path/" as POSIX file)
end tell
repeat with company in companies
set company to escapeChars(company, accentedChars)
set theCommand to "grep -c " & quoted form of company & ¬
space & quoted form of searchFile
try
do shell script theCommand
set CompanyName to company as string
return CompanyName
on error
end try
end repeat
return false
(**
* Checks each character of a given word. If any characters of the word
* match a character in the given list of characters they will be escapd.
*
* @param {text} searchWord - The word to check the characters of.
* @param {text} charactersList - List of characters to be escaped.
* @returns {text} The new text with the item(s) replaced.
*)
on escapeChars(searchWord, charactersList)
repeat with char in charactersList
set searchWord to findAndReplace(char, ("\\" & char), searchWord)
end repeat
return searchWord
end escapeChars
(**
* Replaces all occurances of findString with replaceString
*
* @param {text} findString - The text string to find.
* @param {text} replaceString - The replacement text string.
* @param {text} searchInString - Text string to search.
* @returns {text} The new text with the item(s) replaced.
*)
on findAndReplace(findString, replaceString, searchInString)
set oldTIDs to text item delimiters of AppleScript
set text item delimiters of AppleScript to findString
set searchInString to text items of searchInString
set text item delimiters of AppleScript to replaceString
set searchInString to "" & searchInString
set text item delimiters of AppleScript to oldTIDs
return searchInString
end findAndReplace
Currently your grep pattern only reports the number of lines that the word was found on. Not how many instances of the word were found.
If you want the actual number of instances of the word then use the -o
option with grep
to output each occurrence. Then pipe that to wc
with the -l
option to count the number of lines. For example:
grep -o 'Württemberg' /tmp/output.txt | wc -l
and in your AppleScript that would be:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l"
Tip: If your want to remove the leading spaces in the count/number that gets logged then pipe it to sed
to strip the spaces: For example via your script:
set theCommand to "grep -o " & quoted form of company & space & ¬
quoted form of searchFile & "| wc -l | sed -e 's/ //g'"
and the equivalent via the command line:
grep -o 'Württemberg' /tmp/output.txt | wc -l | sed -e 's/ //g'
Upvotes: 3