Reputation: 7750
I'm running an educational website which is teaching programming to kids (12-15 years old).
As they don't all speak English in the code source of the solutions we are using French variables and functions names. However we are planing to translate the content into other languages (German, Spanish, English). To do so I would like to translate the source code as fast as possible. We mostly have C/C++ code.
The solution I'm planning to use :
Is there already some open-source code/project that can do that ? (For the points 1,2 and 4)
If there isn't, the most difficult point in the first one : using a C/C++ parser to build a syntactical tree and then extracting the variables with their position seems the way to go. Do you have others ideas ?
Thank you for any advice.
Edit : As noted in a comment I will also need to take care of the comments but there is only a few of them : the complete solution is already explained in plain-text and then we are showing the code-source with self-explained variable/function names. The source code is rarely more that 30/40 lines long and good names must make it understandable without comments if you already know what the code is doing.
Additional info : for the people interested the website is a training platform for the International Olympiads in Informatics and C/C++ (at least the minimum needed for programming contest) is not so difficult to learn by a 12 years old.
Upvotes: 12
Views: 6691
Reputation: 62058
I don't think replacing identifiers in the code is a good idea.
First, you are not going to get decent translations. A very important point here is that translation (especially automatic or pretty dumb translation) loses and distorts information. You may actually end up with something that's worse than the original.
Second, if the code is meant to be compiled again, the compiler may not be able to compile code containing non-English letters in the translated identifiers.
Third, if you replace identifiers with something else, you need to make sure you don't replace 2 or more different identifiers with the same word. That'll either make the code non-compilable or ruin its logic.
Fourth, you must make sure you don't translate reserved words and identifiers coming from the standard library of the language either. Translating those will make the code non-compilable and unreadable. It may not be a very trivial task to differentiate between the identifiers that the programmer has defined from those provided by the language and its standard library.
What I'd do instead of replacing identifiers with their translations is, provide the translations as comments next to them, for example:
void eat/*comer*/(int* food/*comida*/)
{
if (*food/*comida*/ <= 0)
{
printf("nothing to eat!"/*no hay que comer!*/);
exit/*salir*/(-1);
}
(*food/*comida*/)--;
}
This way you lose no information due to incorrect translation and don't break the code.
Upvotes: 0
Reputation: 47533
You don't really need a C/C++ parser, just a simple lexer that gives you elements of the code one by one. Then you get a lot of {
, [
, 213
, )
etc that you simply ignore and write to the result file. You translate whatever consists of only letters (except keywords) and you put them in the output.
Now that I think about it, it's as simple as this:
bool is_letter(char c)
{
return (c >= 'a' && c <= 'z') || (c >= 'A' && c <= 'Z');
}
bool is_keyword(string &s)
{
return s == "if" || s == "else" || s == "void" /* rest of them */;
}
void translateCode(istream &in, ostream &out)
{
while (!in.eof())
{
char c = in.get();
if (is_letter(c))
{
string name = "";
do
{
name += c;
c = in.get();
} while (is_letter(c) && !in.eof());
if (is_keyword(name))
out << name;
else
out << translate(name);
}
out << c; // even if is_letter(c) was true, there is a new c from the
// while inside that was read (which was not letter), but
// not written, so would be written here.
}
}
I wrote the code in the editor, so there may be minor errors. Tell me if there are any and I'll fix it.
Edit: Explanation:
What the code does is simply to read input character by character, outputting whatever non-letter characters it reads (including spaces, tabs and new lines). If it does see a letter though, it will start putting all the following letters in one string (until it reaches another non-letter). Then if the string was a keyword, it would output the keyword itself. If it was not, would translate it and output it.
The output would have the exact same format as the input.
Upvotes: 2
Reputation: 19862
I really think you can use clang (libclang) to parse your sources and do what you want (see here for more information), the good news is that they have python bindings, which will make your life easier if you want to access a translation service or something like that.
Upvotes: 2
Reputation: 23332
Are you sure you need a full syntax tree for this? I think it would be enough to do lexical analysis to find the identifiers, which is much easier. Then exclude keywords and identifiers that also appear in the header files being included.
In principle it is possible that you want different variables with the same English name to be translated to different words in French/German -- but for educational use the risk of this arising is probably small enough to ignore at first. You could sidestep the issue by writing the original sources with some disambiguating quasi-Hungarian prefixes and then remove these with the same translation mechanism for display to English-speaking end users.
Be sure to let translators see the name they are translating with full context before they choose a translation.
Upvotes: 2