Get the most repeated similar fields in MySQL database

Question

Let's assume we have a database like:

Actions_tbl:

--------------------------------------------------------
id | Action_name                              | user_id|
--------------------------------------------------------
1  |  John reads one book                     | 1     
2  |  reading the book by john                | 1
3  |  Joe is jumping over fire                | 2
4  |  reading another book                    | 2
5  |  John reads the book in library          | 1
6  |  Joe read a    book                      | 2
7  |  read a book                             | 3
8  |  jumping with no reason is Ronald's habit| 3

Users_tbl:

-----------------------
user_id |    user_name |
-----------------------
1       |     John
2       |     Joe
3       |     Ronald
4       |     Araz
-----------------------

Wondering if I can choose the most repeated similar action regardless of it's user and replace my own user_name with its current user!

Read one book, reading the book, reading another book, read the book in library, read a book and read a book are the ones who have most common WORDS so the staffs related to reading the book is repeated 6 times, my system should show one of those six sentences randomly and replace Araz with user_name

Like: Araz reads the book

My Idea was to

select replace(a.action_name , b.user_name) from actions_tbl a, user_tble b where a.user_id = b.user_id group_by

and then check the similarities one by one in php using

levenshtein()

But this one doesn't have performance at all!

Assume that I want to do the same thing for a big db and for few different tables. This will destroy my server!!!

Any better IDEA?

in http://www.artfulsoftware.com/infotree/queries.php#552 the levenshtein() function is implemented as a MySQL function but firstly, do u think it has enough performance? and then, how to use it in my case? Maybe a self-join van fix this issue but I'm not that good with sql!

* similar action, are the actions that have more than X% common words

** More information and notes:**

I'm limited to PHP and MySQL.
This is just an example, in my real project the actions are long paragraphs. That's why the performance is a matter. The real scenario is: user inputted the description of its project for several projects, those data may be too similar(users would have the same area of work), I want to fill automatically(base on previous fillings) the description of next project, to save time.
I would appreciate if you can have any pragmatical Solution. I checked the NLP related solutions, although they r interesting, but I don't think many of them can be accurate and can be implemented using PHP.
The output should make sense and be a proper paragraph like all other projects. That's why I was thinking of choosing from previous ones.

Thanks for your intellectual answers, its really appreciated if you could shed some light on the situations

Gordon Linoff · Accepted Answer

What you are talking about is a text clustering process. You are trying to find similar pieces of text, and arbitrarily choosing one of them. I am not familiar with any database that does this form of text mining.

For what you describe, a pretty basic text mining technique would probably work. Create a term-document matrix with all the words except the user names. Then use singular value decomposition to get the largest singular value and vector (this is the first principal component of the correlation matrix). The similar activities should cluster along this line.

If you have a limited vocabulary and have the terms in a table, you could measure distance between two actions by the proportion of words that overlap. Do you have a list of all words in the actions?

Get the most repeated similar fields in MySQL database

Answers (2)

Related Questions