Reputation: 40653
Given, say, a recipe (list of ingredients, steps, etc.) in free text form, how could I parse that in such a way I can pull out the ingredients (e.g. quantity, unit of measurements, ingredient name, etc.) usin PHP?
Assume that the free text is somewhat formatted.
Upvotes: 11
Views: 12207
Reputation: 2001
To do it 'properly', you need to define some sort of grammar, and then maybe use a LALR
parser or some tools such as yacc
, bison
or Lex
to build a parser. Assuming you dont want to do that, its strpos()
ftw!
Upvotes: 7
Reputation: 412
If you want to do this quickly, and with gathering the smallest amount of resource-gathering, you can probably come up with some good heuristics and some regular expressions.
Since you say that the list is "somewhat formatted," I'll work on the assumption that there is one ingredient directive per line.
I'd start by coming up with a list of measurement names, which are a relatively-closed class (as we call it in linguistics), like $measurements=['cup', 'tablespoon', 'teaspoon', 'pinch', 'dash', 'to taste', ...]
. You might even come up with a dictionary that maps several items to one normalised value (so $measurements={cup:['cup', 'c'], tablespoon:['tablespoon', 'tbsp', 'tablesp', ...], ...}
or whatnot.)
Then on each line, you can find the unit of measurement if it is in your dictionary. Next, look for numbers (which may be formatted as decimals -- e.g. 1.5 -- or as complex fractions -- e.g. 2 1/2 or 2-1/2), and assume that is the count of the units you need. If there are no numbers, then you can just assume that the unit is one (as maybe the case with "to taste" and the like).
Finally, you can assume anything that is remaining is the actual ingredient.
I imagine this heuristic would cover 75-80% of your cases. You're still going to have a lot of corner cases, like when the recipe calls for "2 oranges", or -- worse! -- "Juice of 2 oranges". In these cases, you would either want to add them (during some sort of off-line curation) as exceptions, or let yourself be "OK" with them not being treated properly.
Upvotes: 1
Reputation: 791
There is openNlp in java for name entity extraction which can fetch you what you are looking see this : http://opennlp.sourceforge.net/models-1.5/
Then you can use php-java connector to get results into php.
Upvotes: 3
Reputation: 28552
There's very similar question for Java. In short, you need dictionaries (of, say, ingredients) and regex-like language over terms (annotations). You can do it in Java and invoke it from PHP via web service or you can try to re-implement it in PHP (note, that in second case you may have significant slowdown).
Upvotes: 1
Reputation: 37
Without a ton of language modeling, I think the only way would be to have a huge list of ingredients and search for them in the recipe. The quantity should be the word immediately prior to the ingredient.
Upvotes: 0