Reputation: 1351
I am trying to accomplish the following. Let's say we have a table that contains these fields (ID, content)
1 | apple
2 | pineapple
3 | application
4 | nation
now, I am looking for a function that will tell me all possible common matches. For example, if the argument is "3", the function will return all possible strings from 3 characters that appear in more then one record.
In this case, I get "app","ppl","ple","ati","tio","ion"
If the argument is "4", i get: "appl","pple","atio","tion"
If the arugment is "5", i get: "apple","ation"
If the argument is "6", nohting is returned.
Untill now, I did not find a function that accomplishes this.
Thx!
Some extra information: I am using this in a PHP script with a MySQL database. I really just want to give the amount of characters as an argument and of course the table to search in.
Upvotes: 3
Views: 1457
Reputation: 2617
Well, this is kind of ugly, but it does work fine. It's generic SQL and will work in any environment. Simply generate a number of selects of a substring that is greater than the maximum length of the field that you're reading. Change the number 50 in the function to a number that exceeds your fieldlength. It may return a realllly long query, but like I said, it'll work fine. Here is an example in Python:
import sqlite3
c = sqlite3.connect('test.db')
c.execute('create table myTable (id integer, content varchar[50])')
for id, content in ((1,'apple'),(2,'pineapple'),(3,'application'),(4,'nation')):
c.execute('insert into myTable values (?,?)', [id,content])
c.commit();
def GenerateSQL(substrSize):
subqueries = ["select substr(content,%i,%i) AS substr, count(*) AS myCount from myTable where length(substr(content,%i,%i))=%i group by substr(content,%i,%i) " % (i,substrSize,i,substrSize,substrSize,i,substrSize) for i in range(50)]
sql = 'select substr FROM \n\t(' + '\n\tunion all '.join(subqueries) + ') \nGROUP BY substr HAVING sum(myCount) > 1'
return sql
print GenerateSQL(3)
print c.execute(GenerateSQL(3)).fetchall()
The query generated looks like:
select substr FROM
(select substr(content,0,3) AS substr, count(*) AS myCount from myTable where length(substr(content,0,3))=3 group by substr(content,0,3)
union all select substr(content,1,3) AS substr, count(*) AS myCount from myTable where length(substr(content,1,3))=3 group by substr(content,1,3)
union all select substr(content,2,3) AS substr, count(*) AS myCount from myTable where length(substr(content,2,3))=3 group by substr(content,2,3)
union all select substr(content,3,3) AS substr, count(*) AS myCount from myTable where length(substr(content,3,3))=3 group by substr(content,3,3)
union all select substr(content,4,3) AS substr, count(*) AS myCount from myTable where length(substr(content,4,3))=3 group by substr(content,4,3)
... )
GROUP BY substr HAVING sum(myCount) > 1
And the results it produces are:
[(u'app',), (u'ati',), (u'ion',), (u'nat',), (u'pin',), (u'ple',), (u'ppl',), (u'tio',)]
Upvotes: 3
Reputation: 4979
I'm sorry as I haven't been playing with php for a while & I don't have a proper test environment for it, but I quickly devised a way of doing this in c# 3.5
pseudocode: build a table with strings of the specified length & a count of occurences next to it. Select where count > 1:
static void Main(string[] args)
{
string[] data = { "apple", "pinapple", "application", "nation" };
string[] result = my_func(3,data);
foreach (string str in result)
{
Console.WriteLine(str);
}
Console.ReadKey();
}
private static string[] my_func(int l, string[] data)
{
Dictionary<string,int> dict = new Dictionary<string,int>();
foreach (string str in data)
{
for (int i = 0; i < str.Length - l + 1; i++)
{
string part = str.Substring(i, l);
if (dict.ContainsKey(part))
{
dict[part]++;
}else {
dict.Add(part,1);
}
}
}
var result = from k in dict.Keys
where dict[k] > 1
orderby dict[k] descending
select k;
return result.ToArray<string>();
}
Upvotes: 2
Reputation: 25790
One obvious option is to use REGEX. I have no prior experience in this but this might be of help to you: http://dev.mysql.com/doc/refman/5.1/en/regexp.html
You'll need to find a suitable expression to match what you need.
Upvotes: 0