chris
chris

Reputation: 8699

php (fuzzy) search matching

if anyone has ever submitted a story to digg, it checks whether or not the story is already submitted, I assume by a fuzzy search.

I would like to implement something similar and want to know if they are using a php class that is open source?

Soundex isnt doing it, sentences/strings can be up to 250chars in length

Upvotes: 9

Views: 14368

Answers (3)

John Kramlich
John Kramlich

Reputation: 2260

I would suggest taking the user's submitted URLs and storing them in multiple parts: domain name, path and query string. Use the PHP parse_url() function to derive the parts of the submitted URL.

Index at least the domain name and path. Then, when a new user submits URL you search your database for a record matching the domain and path. Since the columns are indexed, you will be filtering out first all records that are not in the same domain, and then searching through the remaining records. Depending on your dataset, this should be faster that simply indexing the entire URL. Make sure your WHERE clause is setup in the right order.

If that does not meet your needs I would suggest trying Sphinx. Sphinx is an open source SQL full text search engine that is far faster that MySQL's built in full-text search. It supports stemming and some other nice features.

http://sphinxsearch.com/

You could also take the title or text content of the users submission, run it through a function to generate keywords, and search the database for existing records with those or similar keywords.

Upvotes: 2

pp19dd
pp19dd

Reputation: 3633

Unfortunately, doing this in PHP is prohibitively expensive (high CPU and memory utilization.) However, you can certainly apply the algorithm to small data sets.

To specifically expand on how you can create a server meltdown: couple of built-in PHP functions will determine "distance" between strings: levenshtein and similar_text.

Dummy data: (pretend they're news headlines)

$titles = <<< EOF
Apple
Apples
Orange
Oranges
Banana
EOF;

$titles = explode("\n", $titles );

At this point, $titles should just be an array of strings. Now, create a matrix and compare each headline against EVERY other headline for similarity. In other words, for 5 headlines, you will get a 5 x 5 matrix (25 entries.) That's where the CPU and memory sink goes in.

That's why this method (via PHP) can't be applied to thousands of entries. But if you wanted to:

$matches = array();
foreach( $titles as $title ) {
    $matches[$title] = array();
    foreach( $titles as $compare_to ) {
        $matches[$title][$compare_to] = levenshtein( $compare_to, $title );
    }
    asort( $matches[$title], SORT_NUMERIC  );
}

At this point what you basically have is a matrix with "text distances." In concept (not in real data) it looks sort of like this table below. Note how there is a set of 0 values that go diagonally - that means that in the matching loop, two identical words are -- well, identical.

       Apple Apples Orange Oranges Banana
Apple    0     1      5      6       6
Apples   1     0      6      5       6
Orange   5     6      0      1       5
Oranges  6     5      1      0       5
Banana   6     6      5      5       0

The actual $matches array looks sort of like this (truncated):

Array
(
    [Apple] => Array
        (
            [Apple] => 0
            [Apples] => 1
            [Orange] => 5
            [Banana] => 6
            [Oranges] => 6
        )

    [Apples] => Array
        (
      ...

Anyhow, it's up to you to (by experimentation) determine what a good numerical distance cutoff might mostly match - and then apply it. Otherwise, read up on sphinx-search and use it - since it does have PHP libraries.

Orange you glad you asked about this?

Upvotes: 7

Pete
Pete

Reputation: 1783

You could (depending on the size of your dataset) use mySQL's FULLTEXT search, and look for item(s) that have a high score and are within a certain timeframe, and suggest this/these to the user.

More about score here: MySQL Fulltext Search Score Explained

Upvotes: 1

Related Questions