Chiefir
Chiefir

Reputation: 2671

Python: how to sort a list of strings by substring relevance?

I have some list of strings, for example:

["foo bar SOME baz TEXT bob",
"SOME foo bar baz bob TEXT",
"SOME foo TEXT",
"foo bar SOME TEXT baz",     
"SOME TEXT"]

I want it to be sorted by exactness to SOME TEXT substring (upper case doesn't matter). Something like this order:

["SOME TEXT",
"foo bar SOME TEXT baz",
"SOME foo TEXT",
"foo bar SOME baz TEXT bob",
"SOME foo bar baz bob TEXT"]

The idea is - the best score gets the string with the best match to substring words position. And for bigger amount of "sloppy" words between substring's words - the lower ordering it gets.

I have found some libraries like fuzzyset, or Levenshtein distance but I'm not sure this is what I need. I know the exact substring by what I want to sort and those libs search the similar words, as I understood.

Actually I need to do this sort after some database query (Postgresql) in my Django project. I have already tried full-text search with its ORM, but didn't get this relevant sort order (it doesn't count the distance between substring words). Next I have tried Haystack+Whoosh, but also at this moment didn't find info how to do this sort there. So idea now is to get query set and next sort it out of the database (yep, I know that might be a bad decision, but for now I want it just work). But if anybody tells me how to do this within any of technologies, I have mentioned here - that will be also super cool. Thank you!

p.s. The length of substring supposed to be 2-10 words in max 20 word string.

Upvotes: 1

Views: 3378

Answers (3)

Sash Sinha
Sash Sinha

Reputation: 22360

You can use difflib.SequenceMatcher, to achieve something very similar to your desired output:

>>> import difflib
>>> l = ["foo bar SOME baz TEXT bob", "SOME foo bar baz bob TEXT", "SOME foo TEXT", "foo bar SOME TEXT baz", "SOME TEXT"]
>>> sorted(l, key=lambda z: difflib.SequenceMatcher(None, z, "SOME TEXT").ratio(), reverse=True)
['SOME TEXT', 'SOME foo TEXT', 'foo bar SOME TEXT baz', 'foo bar SOME baz TEXT bob', 'SOME foo bar baz bob TEXT']

If you can't tell the only difference is that the position of the two elements "foo bar SOME TEXT baz" and "SOME foo TEXT" are swapped compared to your desired output.

Upvotes: 7

CodeCollector
CodeCollector

Reputation: 476

Here is my take on it.

l = ["foo bar SOME baz TEXT bob",
"SOME foo bar baz bob TEXT",
"SOME foo TEXT",
"foo bar SOME TEXT baz",     
"SOME TEXT"]

l.sort(key=lambda x: (x.find("SOME")-x.find("TEXT"))*0.9-0.1*x.find("SOME"), reverse=True)

print(l)

OUTPUT:

['SOME TEXT', 'foo bar SOME TEXT baz', 'SOME foo TEXT', 'foo bar SOME baz TEXT bob', 'SOME foo bar baz bob TEXT']

So what we have done is sorted the list based on major weight to the distance between "SOME" and "TEXT" and some minor weight to the occurrence of "SOME" in the string.

Another longer way would be to first group the list based on the their distance between SOME and TEXT. And then sort the each group based on the position of "SOME".

Upvotes: 0

Prune
Prune

Reputation: 77837

See your friendly neighborhood sorting tutorial. You'll need a sort with a key. Here's a trivial function to give you the idea; it finds the distance between the two words, returning that as the difference metric.

sentence = ["foo bar SOME baz TEXT bob",
            "SOME foo bar baz bob TEXT",
            "SOME foo TEXT",
            "foo bar SOME TEXT baz",
            "SOME TEXT"]

def match_score(sentence):
    some_pos = sentence.find("SOME")
    text_pos = sentence.find("TEXT")
    return abs(text_pos - some_pos)

sentence.sort(key = lambda x: match_score(x))

for item in sentence:
    print(item)

Output:

foo bar SOME TEXT baz
SOME TEXT
foo bar SOME baz TEXT bob
SOME foo TEXT
SOME foo bar baz bob TEXT

Upvotes: 1

Related Questions