Reputation: 13181
Problem
MacOSX comes with dictionaries stored in /Library/Dictionaries
. I would like to parse them to obtain dictionary results programmatically (via Terminal, AppleScript, or Automator). The dictionaries are MacOSX packages and all have a Contents
folder that contains a file called Body.data
. I would like to parse that file for a UTF-8 string (maybe Chinese character double bytes) and return the lines where the string is found.
I've tried the following, which is not returning any results:
find . -name 'Body.data' -exec grep -li '我' {} \;
When I search through the dictionary using the app interface I can find the appropriate text. My objective is to create a workflow service to translate selected Chinese text into the pinyin equivalents which are stored in the system/user dictionaries.
Update
The following worked for me based on the accepted answer:
Created and Archived a command line utility called rdef
using Xcode with this code:
#import <Foundation/Foundation.h>
int main(int argc, const char * argv[])
{
@autoreleasepool {
if(argc < 2)
{
printf("Usage: rdef <word to define>");
return -1;
}
NSString * search =
[NSString stringWithCString: argv[1] encoding: NSUTF8StringEncoding];
CFStringRef def =
DCSCopyTextDefinition(NULL,
(__bridge CFStringRef)search,
CFRangeMake(0, [search length]));
NSString * output =
[NSString stringWithFormat: @"Definition of <%@>: %@", search, (__bridge NSString *)def];
printf("%s", [output UTF8String]);
}
return 0;
}
Added the following to my project frameworks:
Performed a Build and then deployed manually using the steps below.
To deploy:
Right-clicked the Archived package and chose Show in Finder. Then Show Package Contents and drilled down product folder and copied the executable to /local/usr/bin
. Now from a command prompt I can run the utility like so:
rdef 我|awk -F '\|' '{ gsub(/^ +| +$/, "", $2); print $2 }'
Please see the accepted answer below for extended references.
NB: The github for the utility can be found at https://github.com/mingsai/rdef.git
Next I will just create a Service to call the utility from Automator against selected text.
Service Solution
To pay it forward for the folks who've helped, especially @mklement0: here is the Solution for taking the command utility and converting it to a MacOSX service that can be used to translate Chinese characters to pinyin.
Create a new Automator Service file and make sure to select output replaces selected text.
Automator Script details
PATH=/bin:/usr/bin:/sbin:/usr/sbin:/usr/local/bin/:
export PATH
LC_CTYPE=UTF-8
x=$1
for ((i=0;i<${#x};i++)); do rdef "${x:i:1}" | awk -F '\|' 'BEGIN {ORS=" "}{ gsub(/^ | +?/, "", $2); if (length($2) > 0) print $2 ; exit}'; done
To make the Service "live" just delete the "Ask for Text" and save the service with name of your choice (e.g. Convert to Pinyin).
To use the revised service highlight any Chinese characters and right click the context menu then on the bottom under the Services menu select "Convert to Pinyin" ... (as indicated below)
Usage
Produces this output
Hope that helps anyone with this problem.
Upvotes: 3
Views: 3470
Reputation: 438153
grep
operates on text files, but the Body.data
files are not text files, unfortunately.
Your best bet is probably to create your own command-line utility in Xcode, as suggested here (sample code): https://discussions.apple.com/thread/2679911
Here's Apple's dictionary API documentation: https://developer.apple.com/library/mac/documentation/UserExperience/Conceptual/DictionaryServicesProgGuide/access/access.html#//apple_ref/doc/uid/TP40006152-CH5-SW1
Update:
Assuming you've created a utility named rdef
that returns something like 'Definition of <我>: | wǒ | I me my'
, use the following awk
command to parse out the pinyin:
rdef "我" | awk -F ' *[|] *' '{ print $2 }'
Alternatively, if an online-based solution is an option, you could try a Google Translate-based solution.
At least in interactive use you get a pinyin transcription below the input field.
For instance, your example symbol is transcribed as "Wǒ":
http://translate.google.com/?text=%E6%88%91#zh-CN/en/%E6%88%91
Upvotes: 2
Reputation: 207475
I had a look in the Chinese Simplified
and the Oxford English Dictionary
and both have a Contents
and Body.data
file as you say. However, if I run
file Body.data
it just says data
(rather than ASCII
text, or UTF-8
) - meaning that the file is binary rather than ASCII so grep and its friends are not going to work very well on them at all.
In case anyone is good at spotting a filetype from a hex dump, the files start off like this:
0000000 0000 0000 0000 0000 0000 0000 0000 0000
\0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0 \0
*
0000100 c9a8 0106 0000 0000 ffff ffff 0020 0000
250 311 006 001 \0 \0 \0 \0 377 377 377 377 \0 \0 \0
0000120 0000 0000 0207 0000 ffff ffff ffff ffff
\0 \0 \0 \0 \a 002 \0 \0 377 377 377 377 377 377 377 377
0000140 8009 0000 8005 0000 8c22 0004 9c78 bddc
\t 200 \0 \0 005 200 \0 \0 " 214 004 \0 x 234 ܽ **
0000160 6c6b db1b 2f7e e416 49a6 349a c5b8 902d
k l 033 333 ~ / 026 344 246 I 232 4 270 305 - 220
0000200 fda2 7134 7880 d4ef 2cb6 96d9 9dad f673
Upvotes: 1