Andy P
Andy P

Reputation: 111

Text Classification and Searching in Python

I am working on a small project and need some help with searching for text in strings

Lets say I have a primary string1 such as : Loan Coordinator

Lets say I have another string2 such as : Financial Student Loan Coordinator

Lets say I have another string3 such as : Loan Operator

Lets say I have another string4 such as : Coordinator

Lets say I have another string5 such as : Financial Assistant

. .

In Python, what would be the best approarch to find all strings that have something to do with string1? For example:

String 2 has to deal with String 1 because of the text Loan Coordinator within the String

String 3 has something to do because of the word Loan

String 4 has something to do because of the word Coordinator

String 5 has nothing to do so i dont care about this string.

2, 3, and 4 should return FOUND or something that indicates a small match is present.

..

Thanks for all assistance!

Upvotes: 0

Views: 274

Answers (2)

Tom Dalton
Tom Dalton

Reputation: 6190

#!/usr/bin/env python
import sys


def tokenise(s):
    return set([word.lower() for word in s.split()])


def match_strings(primary, secondary):
    primary_tokens = tokenise(primary)
    secondary_tokens = tokenise(secondary)

    matches = primary_tokens.intersection(secondary_tokens)
    if matches:
        print "{} matches because of {}".format(secondary, ", ".join(matches))
    else:
        print "{} doesnt match".format(secondary)


if __name__ == "__main__":
    primary = sys.argv[1]
    secondaries = sys.argv[2:]

    for secondary in secondaries:
        match_strings(primary, secondary)

Running the code:

~/string_matcher.py "Loan Coordinator" "Financial Student Loan Coordinator" "Loan Operator" "Coordinator" "Financial Assistant"
Financial Student Loan Coordinator matches because of coordinator, loan
Loan Operator matches because of loan
Coordinator matches because of coordinator
Financial Assistant doesnt match

Upvotes: 1

Cory Kramer
Cory Kramer

Reputation: 118011

You can use set intersection. Make a set of unique words in your string to compare against. Then take the intersection with the set of words from each of the other strings. Keep any string that has a non-empty intersection.

>>> s1 = 'Loan Coordinator'
>>> sList = ['Financial Student Loan Coordinator', 'Loan Operator', 'Coordinator', 'Financial Assistant']

>>> unique = set(s1.split())  # unique words in string 1

>>> [i for i in sList if unique & set(i.split())]
['Financial Student Loan Coordinator', 'Loan Operator', 'Coordinator']

Upvotes: 1

Related Questions