Reputation: 1325
After plumbing the documentation/past questions on list operations, I've come up blank - many of the cases involve numbers, whereas I'm working with large quantities of text.
I have a sorted list of common three-word phrases (trigrams) that appear in a large body of textual information, generated through Mathematica's Partition[], Tally[], and Sort[] commands. An example of the sort of data that I'm operating on (I have hundreds of these files):
{{{wa, wa, wa}, 66}, {{i, love, you}, 62}, {{la, la, la}, 50}, {{meaning, of, life}, 42}, {on, come, on}, 40}, {{come, on, come}, 40}, {{yeah, yeah, yeah}, 38}, {{no, no, no}, 36}, {{we, re, gonna}, 36}, {{you, love, me}, 35}, {{in, love, with}, 32}, {{the, way, you}, 30}, {{i, want, to}, 30}, {{back, to, me}, 29}, <<38211>>, {{of, an, xke}, 1}}
I'm hoping to search this file so that if the input is "meaning, of, life" it will return "42." I feel like I must be overlooking something obvious but after tinkering around I've hit a brick wall here. Mathematica is number heavy in its documentation, which is.. well, unsurprising.
Upvotes: 5
Views: 205
Reputation: 6989
Quite an old question.. but now we have Association
lookup = Association[Rule @@@ trigrams];
lookup[{"come", "on", "come"}]
40
or even
lookup = Association[
Rule[StringJoin@Riffle[#1, " "], #2] & @@@ trigrams]
lookup["meaning of life"]
42
Upvotes: 1
Reputation: 6884
This is one way to get the individual words in your string into a list.
In[262]:= str = "meaning, of, life"; ReadList[
StringToStream[str], Word, WordSeparators -> {",", " "}]
Out[262]= {"meaning", "of", "life"}
You could use this in a Cases or other form of look-up to get the 42 result (very suspicious, that figure...)
--- edit---
By "look-up" I have in mind the sort of mechanism shown by Leonid Shifrin. I was uncertain as to whether the difficulty being encountered was that, or simply converting from strings to lists of triads. I (only) show a way to manage the latter.
--- end edit ---
--- edit 2 ---
A comment shows ways to avoid ReadList. Let me state for the record that I'm ecstatic I managed to find that approach. Below is the code I had put into my original response, then replaced when I realized there was a more concise code.
str = "meaning, of, life";
commaposns = StringPosition[str, ", "];
substrposns =
Partition[
Join[{1},
Riffle[commaposns[[All, 1]] - 1, commaposns[[All, 2]] + 1], {-1}],
2];
substrs = Map[StringTake[str, #] &, substrposns]
Out[259]= {"meaning", "of", "life"}
Bottom line (almost literally): I can find convoluted approaches as well as anyone else, and better than most.
--- end edit ---
Daniel Lichtblau
Upvotes: 5
Reputation: 14731
This is probably not as fast as the solution that Leonid gave, but you could just turn your list of pairs into a list of rules.
In[1]:= trigrams = {{{"wa", "wa", "wa"}, 66}, {{"i", "love", "you"},
62}, {{"la", "la", "la"}, 50}, {{"meaning", "of", "life"},
42}, {{"on", "come", "on"}, 40}, {{"come", "on", "come"},
40}, {{"yeah", "yeah", "yeah"}, 38}, {{"no", "no", "no"},
36}, {{"we", "re", "gonna"}, 36}, {{"you", "love", "me"},
35}, {{"in", "love", "with"}, 32}, {{"the", "way", "you"},
30}, {{"i", "want", "to"}, 30}, {{"back", "to", "me"},
29}, {{"of", "an", "xke"}, 1}};
In[2]:= trigramRules = Rule @@@ trigrams;
Which (if you want) you can wrap up in a function that has a similar behaviour to Leonid's
In[3]:= trigram[seq__String] := {seq} /. trigramRules
In[4]:= trigram["meaning", "of", "life"]
Out[4]= 42
Since you have a very large list of pairs, then the application of the generated rules can be sped up by using Dispatch
. That is, do everything else the same as above, except define trigramRules
using
trigramRules = Dispatch[Rule @@@ trigrams]
Upvotes: 5
Reputation: 22579
Assuming that you can load your data into Mathematica in the form you outlined, one very simple thing to do is to create a hash-table, where your trigrams will be the (compound) keys. Here is your sample (the part of it that you gave):
trigrams = {{{"wa", "wa", "wa"}, 66}, {{"i", "love", "you"}, 62},
{{"la", "la", "la"}, 50}, {{"meaning", "of", "life"}, 42},
{{"on", "come", "on"}, 40}, {{"come", "on", "come"}, 40},
{{"yeah", "yeah", "yeah"}, 38}, {{"no", "no", "no"}, 36},
{{"we", "re", "gonna"}, 36}, {{"you", "love", "me"}, 35},
{{"in", "love", "with"}, 32}, {{"the", "way", "you"}, 30},
{{"i", "want", "to"}, 30}, {{"back", "to", "me"}, 29},
{{"of", "an", "xke"}, 1}};
Here is one possible way to create a hash-table:
Clear[trigramHash];
(trigramHash[Sequence @@ #1] = #2) & @@@ trigrams;
Now, we use it like
In[16]:= trigramHash["meaning","of","life"]
Out[16]= 42
This approach will be beneficial if you perform many searches, of course.
EDIT
If you have many files and want to search them efficiently in Mathematica, one thing you could do is to use the above hashing mechanism to convert all your files to .mx
binary Mathematica files. These files are optimized for fast loading, and serve as a persistence mechanism for definitions you want to store. Here is how it may work:
In[20]:= DumpSave["C:\\Temp\\trigrams.mx",trigramHash]
Out[20]= {trigramHash}
In[21]:= Quit[]
In[1]:= Get["C:\\Temp\\trigrams.mx"]
In[2]:= trigramHash["meaning","of","life"]
Out[2]= 42
You use DumpSave
to create an .mx
file. So, the suggested procedure is to load your data into Mathematica, file by file, create hashes (you could use SubValues
to index a particular hash-table with an index of your file), and then save those definitions into .mx
files. In this way, you get fast load and fast search, and you have a freedom to decide which part of your data to keep loaded into Mathematica at any given time (pretty much without a performance hit, normally associated with file loading).
Upvotes: 6