Reputation: 6957
As we all know numbers can be written either in numerics, or called by their names. While there are a lot of examples to be found that convert 123 into one hundred twenty three, I could not find good examples of how to convert it the other way around.
Some of the caveats:
And there are probably more caveats possible that are not yet listed. Suppose the algorithm needs to be very robust, and even understand spelling mistakes.
What fields/papers/studies/algorithms should I read to learn how to write all this? Where is the information?
PS: My final parser should actually understand 3 different languages, English, Russian and Hebrew. And maybe at a later stage more languages will be added. Hebrew also has male/female numbers, like "one man" and "one woman" have a different "one" — "ehad" and "ahat". Russian also has some of its own complexities.
Google does a great job at this. For example:
(the reverse is also possible http://www.google.com/search?q=999999999999+in+english)
Upvotes: 51
Views: 16167
Reputation: 27574
I was converting ordinal edition statements from early modern books (e.g. "2nd edition", "Editio quarta") to integers and needed support for ordinals 1-100 in English and ordinals 1-10 in a few Romance languages. Here's what I came up with in Python:
def get_data_mapping():
data_mapping = {
"1st": 1,
"2nd": 2,
"3rd": 3,
"tenth": 10,
"eleventh": 11,
"twelfth": 12,
"thirteenth": 13,
"fourteenth": 14,
"fifteenth": 15,
"sixteenth": 16,
"seventeenth": 17,
"eighteenth": 18,
"nineteenth": 19,
"twentieth": 20,
"new": 2,
"newly": 2,
"nova": 2,
"nouvelle": 2,
"altera": 2,
"andere": 2,
# latin
"primus": 1,
"secunda": 2,
"tertia": 3,
"quarta": 4,
"quinta": 5,
"sexta": 6,
"septima": 7,
"octava": 8,
"nona": 9,
"decima": 10,
# italian
"primo": 1,
"secondo": 2,
"terzo": 3,
"quarto": 4,
"quinto": 5,
"sesto": 6,
"settimo": 7,
"ottavo": 8,
"nono": 9,
"decimo": 10,
# french
"premier": 1,
"deuxième": 2,
"troisième": 3,
"quatrième": 4,
"cinquième": 5,
"sixième": 6,
"septième": 7,
"huitième": 8,
"neuvième": 9,
"dixième": 10,
# spanish
"primero": 1,
"segundo": 2,
"tercero": 3,
"cuarto": 4,
"quinto": 5,
"sexto": 6,
"septimo": 7,
"octavo": 8,
"noveno": 9,
"decimo": 10
}
# create 4th, 5th, ... 20th
for i in xrange(16):
data_mapping[str(4+i) + "th"] = 4+i
# create 21st, 22nd, ... 99th
for i in xrange(79):
last_char = str(i)[-1]
if last_char == "0":
data_mapping[str(20+i) + "th"] = 20+i
elif last_char == "1":
data_mapping[str(20+i) + "st"] = 20+i
elif last_char == "2":
data_mapping[str(20+i) + "nd"] = 20+i
elif last_char == "3":
data_mapping[str(20+i) + "rd"] = 20+i
else:
data_mapping[str(20+i) + "th"] = 20+i
ordinals = [
"first", "second", "third",
"fourth", "fifth", "sixth",
"seventh", "eighth", "ninth"
]
# create first, second ... ninth
for c, i in enumerate(ordinals):
data_mapping[i] = c+1
# create twenty-first, twenty-second ... ninty-ninth
for ci, i in enumerate([
"twenty", "thirty", "forty",
"fifty", "sixty", "seventy",
"eighty", "ninety"
]):
for cj, j in enumerate(ordinals):
data_mapping[i + "-" + j] = 20 + (ci*10) + (cj+1)
data_mapping[i.replace("y", "ieth")] = 20 + (ci*10)
return data_mapping
Upvotes: 0
Reputation: 992717
I have some code I wrote a while ago: text2num. This does some of what you want, except it does not handle ordinal numbers. I haven't actually used this code for anything, so it's largely untested!
Upvotes: 11
Reputation: 5363
Use the Python pattern-en library:
>>> from pattern.en import number
>>> number('two thousand fifty and a half') => 2050.5
Upvotes: 7
Reputation: 1
Here is an extremely robust solution in Clojure.
AFAIK it is a unique implementation approach.
;----------------------------------------------------------------------
; numbers.clj
; written by: Mike Mattie [email protected]
;----------------------------------------------------------------------
(ns operator.numbers
(:use compojure.core)
(:require
[clojure.string :as string] ))
(def number-word-table {
"zero" 0
"one" 1
"two" 2
"three" 3
"four" 4
"five" 5
"six" 6
"seven" 7
"eight" 8
"nine" 9
"ten" 10
"eleven" 11
"twelve" 12
"thirteen" 13
"fourteen" 14
"fifteen" 15
"sixteen" 16
"seventeen" 17
"eighteen" 18
"nineteen" 19
"twenty" 20
"thirty" 30
"fourty" 40
"fifty" 50
"sixty" 60
"seventy" 70
"eighty" 80
"ninety" 90
})
(def multiplier-word-table {
"hundred" 100
"thousand" 1000
})
(defn sum-words-to-number [ words ]
(apply + (map (fn [ word ] (number-word-table word)) words)) )
; are you down with the sickness ?
(defn words-to-number [ words ]
(let
[ n (count words)
multipliers (filter (fn [x] (not (false? x))) (map-indexed
(fn [ i word ]
(if (contains? multiplier-word-table word)
(vector i (multiplier-word-table word))
false))
words) )
x (ref 0) ]
(loop [ indices (reverse (conj (reverse multipliers) (vector n 1)))
left 0
combine + ]
(let
[ right (first indices) ]
(dosync (alter x combine (* (if (> (- (first right) left) 0)
(sum-words-to-number (subvec words left (first right)))
1)
(second right)) ))
(when (> (count (rest indices)) 0)
(recur (rest indices) (inc (first right))
(if (= (inc (first right)) (first (second indices)))
*
+))) ) )
@x ))
Here are some examples
(operator.numbers/words-to-number ["six" "thousand" "five" "hundred" "twenty" "two"])
(operator.numbers/words-to-number ["fifty" "seven" "hundred"])
(operator.numbers/words-to-number ["hundred"])
Upvotes: 3
Reputation: 5083
Well, I was too late on the answer for this question, but I was working a little test scenario that seems to have worked very well for me. I used a (simple, but ugly, and large) regular expression to locate all the words for me. The expression is as follows:
(?<Value>(?:zero)|(?:one|first)|(?:two|second)|(?:three|third)|(?:four|fourth)|
(?:five|fifth)|(?:six|sixth)|(?:seven|seventh)|(?:eight|eighth)|(?:nine|ninth)|
(?:ten|tenth)|(?:eleven|eleventh)|(?:twelve|twelfth)|(?:thirteen|thirteenth)|
(?:fourteen|fourteenth)|(?:fifteen|fifteenth)|(?:sixteen|sixteenth)|
(?:seventeen|seventeenth)|(?:eighteen|eighteenth)|(?:nineteen|nineteenth)|
(?:twenty|twentieth)|(?:thirty|thirtieth)|(?:forty|fortieth)|(?:fifty|fiftieth)|
(?:sixty|sixtieth)|(?:seventy|seventieth)|(?:eighty|eightieth)|(?:ninety|ninetieth)|
(?<Magnitude>(?:hundred|hundredth)|(?:thousand|thousandth)|(?:million|millionth)|
(?:billion|billionth)))
Shown here with line breaks for formatting purposes..
Anyways, my method was to execute this RegEx with a library like PCRE, and then read back the named matches. And it worked on all of the different examples listed in this question, minus the "One Half", types, as I didn't add them in, but as you can see, it wouldn't be hard to do so. This addresses a lot of issues. For example, it addresses the following items in the original question and other answers:
Now, rather than store this monster of a regular expression in your source, I was considering building this RegEx at runtime, using something like the following:
char *ones[] = {"zero", "one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten", "eleven", "twelve",
"thirteen", "fourteen", "fifteen", "sixteen", "seventeen", "eighteen", "nineteen"};
char *tens[] = {"", "", "twenty", "thirty", "forty", "fifty", "sixty", "seventy", "eighty", "ninety"};
char *ordinalones[] = { "", "first", "second", "third", "fourth", "fifth", "", "", "", "", "", "", "twelfth" };
char *ordinaltens[] = { "", "", "twentieth", "thirtieth", "fortieth", "fiftieth", "sixtieth", "seventieth", "eightieth", "ninetieth" };
and so on...
The easy part here is we are only storing the words that matter. In the case of SIXTH, you'll notice that there isn't an entry for it, because it's just it's normal number with TH tacked on... But ones like TWELVE need different attention.
Ok, so now we have the code to build our (ugly) RegEx, now we just execute it on our number strings.
One thing I would recommend, is to filter, or eat the word "AND". It's not necessary, and only leads to other issues.
So, what you are going to want to do is setup a function that passes the named matches for "Magnitude" into a function that looks at all the possible magnitude values, and multiplies your current result by that value of magnitude. Then, you create a function that looks at the "Value" named matches, and returns an int (or whatever you are using), based on the value discovered there.
All VALUE matches are ADDED to your result, while magnitutde matches multiply the result by the mag value. So, Two Hundred Fifty Thousand becomes "2", then "2 * 100", then "200 + 50", then "250 * 1000", ending up with 250000...
Just for fun, I wrote a vbScript version of this and it worked great with all the examples provided. Now, it doesn't support named matches, so I had to work a little harder getting the correct result, but I got it. Bottom line is, if it's a "VALUE" match, add it your accumulator. If it's a magnitude match, multiply your accumulator by 100, 1000, 1000000, 1000000000, etc... This will provide you with some pretty amazing results, and all you have to do to adjust for things like "one half" is add them to your RegEx, put in a code marker for them, and handle them.
Well, I hope this post helps SOMEONE out there. If anyone want, I can post by vbScript pseudo code that I used to test this with, however, it's not pretty code, and NOT production code.
If I may.. What is the final language this will be written in? C++, or something like a scripted language? Greg Hewgill's source will go a long way in helping understand how all of this comes together.
Let me know if I can be of any other help. Sorry, I only know English/American, so I can't help you with the other languages.
Upvotes: 2
Reputation: 13787
Try
Open an HTTP Request to "http://www.google.com/search?q=" + number + "+in+decimal".
Parse the result for your number.
Cache the number / result pairs to lesson the requests over time.
Upvotes: -2
Reputation: 37137
You should keep in mind that Europe and America count differently.
One Thousand
One Million
One Thousand Millions (British also use Milliard)
One Billion
One Thousand Billions
One Trillion
One Thousand Trillions
Here is a small reference on it.
A simple way to see the difference is the following:
(American counting Trillion) == (European counting Billion)
Upvotes: 5
Reputation: 124257
My LPC implementation of some of your requirements (American English only):
internal mapping inordinal = ([]);
internal mapping number = ([]);
#define Numbers ([\
"zero" : 0, \
"one" : 1, \
"two" : 2, \
"three" : 3, \
"four" : 4, \
"five" : 5, \
"six" : 6, \
"seven" : 7, \
"eight" : 8, \
"nine" : 9, \
"ten" : 10, \
"eleven" : 11, \
"twelve" : 12, \
"thirteen" : 13, \
"fourteen" : 14, \
"fifteen" : 15, \
"sixteen" : 16, \
"seventeen" : 17, \
"eighteen" : 18, \
"nineteen" : 19, \
"twenty" : 20, \
"thirty" : 30, \
"forty" : 40, \
"fifty" : 50, \
"sixty" : 60, \
"seventy" : 70, \
"eighty" : 80, \
"ninety" : 90, \
"hundred" : 100, \
"thousand" : 1000, \
"million" : 1000000, \
"billion" : 1000000000, \
])
#define Ordinals ([\
"zeroth" : 0, \
"first" : 1, \
"second" : 2, \
"third" : 3, \
"fourth" : 4, \
"fifth" : 5, \
"sixth" : 6, \
"seventh" : 7, \
"eighth" : 8, \
"ninth" : 9, \
"tenth" : 10, \
"eleventh" : 11, \
"twelfth" : 12, \
"thirteenth" : 13, \
"fourteenth" : 14, \
"fifteenth" : 15, \
"sixteenth" : 16, \
"seventeenth" : 17, \
"eighteenth" : 18, \
"nineteenth" : 19, \
"twentieth" : 20, \
"thirtieth" : 30, \
"fortieth" : 40, \
"fiftieth" : 50, \
"sixtieth" : 60, \
"seventieth" : 70, \
"eightieth" : 80, \
"ninetieth" : 90, \
"hundredth" : 100, \
"thousandth" : 1000, \
"millionth" : 1000000, \
"billionth" : 1000000000, \
])
varargs int denumerical(string num, status ordinal) {
if(ordinal) {
if(member(inordinal, num))
return inordinal[num];
} else {
if(member(number, num))
return number[num];
}
int sign = 1;
int total = 0;
int sub = 0;
int value;
string array parts = regexplode(num, " |-");
if(sizeof(parts) >= 2 && parts[0] == "" && parts[1] == "-")
sign = -1;
for(int ix = 0, int iix = sizeof(parts); ix < iix; ix++) {
string part = parts[ix];
switch(part) {
case "negative" :
case "minus" :
sign = -1;
continue;
case "" :
continue;
}
if(ordinal && ix == iix - 1) {
if(part[0] >= '0' && part[0] <= '9' && ends_with(part, "th"))
value = to_int(part[..<3]);
else if(member(Ordinals, part))
value = Ordinals[part];
else
continue;
} else {
if(part[0] >= '0' && part[0] <= '9')
value = to_int(part);
else if(member(Numbers, part))
value = Numbers[part];
else
continue;
}
if(value < 0) {
sign = -1;
value = - value;
}
if(value < 10) {
if(sub >= 1000) {
total += sub;
sub = value;
} else {
sub += value;
}
} else if(value < 100) {
if(sub < 10) {
sub = 100 * sub + value;
} else if(sub >= 1000) {
total += sub;
sub = value;
} else {
sub *= value;
}
} else if(value < sub) {
total += sub;
sub = value;
} else if(sub == 0) {
sub = value;
} else {
sub *= value;
}
}
total += sub;
return sign * total;
}
Upvotes: 2
Reputation: 48396
One place to start looking is the gnu get_date lib, which can parse just about any English textual date into a timestamp. While not exactly what you're looking for, their solution to a similar problem could provide a lot of useful clues.
Upvotes: -2
Reputation: 21950
I was playing around with a PEG parser to do what you wanted (and may post that as a separate answer later) when I noticed that there's a very simple algorithm that does a remarkably good job with common forms of numbers in English, Spanish, and German, at the very least.
Working with English for example, you need a dictionary that maps words to values in the obvious way:
"one" -> 1, "two" -> 2, ... "twenty" -> 20,
"dozen" -> 12, "score" -> 20, ...
"hundred" -> 100, "thousand" -> 1000, "million" -> 1000000
...and so forth
The algorithm is just:
total = 0
prior = null
for each word w
v <- value(w) or next if no value defined
prior <- case
when prior is null: v
when prior > v: prior+v
else prior*v
else
if w in {thousand,million,billion,trillion...}
total <- total + prior
prior <- null
total = total + prior unless prior is null
For example, this progresses as follows:
total prior v unconsumed string
0 _ four score and seven
4 score and seven
0 4
20 and seven
0 80
_ seven
0 80
7
0 87
87
total prior v unconsumed string
0 _ two million four hundred twelve thousand eight hundred seven
2 million four hundred twelve thousand eight hundred seven
0 2
1000000 four hundred twelve thousand eight hundred seven
2000000 _
4 hundred twelve thousand eight hundred seven
2000000 4
100 twelve thousand eight hundred seven
2000000 400
12 thousand eight hundred seven
2000000 412
1000 eight hundred seven
2000000 412000
1000 eight hundred seven
2412000 _
8 hundred seven
2412000 8
100 seven
2412000 800
7
2412000 807
2412807
And so on. I'm not saying it's perfect, but for a quick and dirty it does quite well.
Addressing your specific list on edit:
Number 6 is the only one I don't have a ready answer for, and that's because of the ambiguity between ordinals and fractions (in English at least) added to the fact that my last cup of coffee was many hours ago.
Upvotes: 47
Reputation: 17039
Ordinal numbers are not applicable because they cant be joined in meaningful ways with other numbers in language (...at least in English)
e.g. one hundred and first, eleven second, etc...
However, there is another English/American caveat with the word 'and'
i.e.
one hundred and one (English) one hundred one (American)
Also, the use of 'a' to mean one in English
a thousand = one thousand
...On a side note Google's calculator does an amazing job of this.
one hundred and three thousand times the speed of light
And even...
two thousand and one hundred plus a dozen
...wtf?!? a score plus a dozen in roman numerals
Upvotes: 4
Reputation: 9487
It's not an easy issue, and I know of no library to do it. I might sit down and try to write something like this sometime. I'd do it in either Prolog, Java or Haskell, though. As far as I can see, there are several issues:
If you are interested in this, I could give it a shot this weekend. My idea is probably using UIMA and tokenizing with it, then going on to further tokenize/disambiguate and finally translate. There might be more issues, let's see if I can come up with some more interesting things.
Sorry, this is not a real answer yet, just an extension to your question. I'll let you know if I find/write something.
By the way, if you are interested in the semantics of numerals, I just found an interesting paper by Friederike Moltmann, discussing some issues regarding the logic interpretation of numerals.
Upvotes: 12