Austin
Austin

Reputation: 23

How to identify a sentences related to a topic?

Im doing a project that require me to sort the document to match with topic.

For example, I have 4 topics which is Lecture, Tutor, Lab and Exam. I have some sentences which are:

  1. Lecture was engaging
  2. Tutor is very nice and active
  3. The content of lecture was too much for 2 hours.
  4. Exam seem to be too difficult compare with weekly lab.

And now I wanna sort these sentences into topic above, result should be:

I did research and the most instruction I found is using LDA topic modeling. But seem like cannot solve my problem because as I know LDA support for identifying topic in document, and dont know how to pre-choose topic manually.

Could anyone help me please? Im stuck with that.

Upvotes: 2

Views: 1771

Answers (4)

alvas
alvas

Reputation: 122052

This is the excellent example to use something smarter than string matching =)

Lets consider this:

  • Is there a way to convert each word to a vector form (i.e. an array of floats)?

  • Is there a way to convert each sentence to the same vector form (i.e. an array of floats the same dimensions as the word's vector form?


First lets get a vocabulary to all the words possible in your list of sentences (let's call it a corpus):

>>> from itertools import chain
>>> s1 = "Lecture was engaging"
>>> s2 = "Tutor is very nice and active"
>>> s3 = "The content of lecture was too much for 2 hours."
>>> s4 = "Exam seem to be too difficult compare with weekly lab."
>>> list(map(word_tokenize, [s1, s2, s3, s4]))
[['Lecture', 'was', 'engaging'], ['Tutor', 'is', 'very', 'nice', 'and', 'active'], ['The', 'content', 'of', 'lecture', 'was', 'too', 'much', 'for', '2', 'hours', '.'], ['Exam', 'seem', 'to', 'be', 'too', 'difficult', 'compare', 'with', 'weekly', 'lab', '.']]
>>> vocab = sorted(set(token.lower() for token in chain(*list(map(word_tokenize, [s1, s2, s3, s4])))))
>>> vocab
['.', '2', 'active', 'and', 'be', 'compare', 'content', 'difficult', 'engaging', 'exam', 'for', 'hours', 'is', 'lab', 'lecture', 'much', 'nice', 'of', 'seem', 'the', 'to', 'too', 'tutor', 'very', 'was', 'weekly', 'with']

Now lets' represent the 4 key words as vectors by using the index of the word in a vocabulary:

>>> lecture = [1 if token == 'lecture' else 0 for token in vocab]
>>> lab = [1 if token == 'lab' else 0 for token in vocab]
>>> tutor = [1 if token == 'tutor' else 0 for token in vocab]
>>> exam = [1 if token == 'exam' else 0 for token in vocab]
>>> lecture
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> lab
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
>>> tutor
[0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]
>>> exam
[0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Similarly, we loop through each sentence and convert them into a vector form:

>>> [token.lower() for token in word_tokenize(s1)]
['lecture', 'was', 'engaging']
>>> s1_tokens = [token.lower() for token in word_tokenize(s1)]
>>> s1_vec = [1 if token in s1_tokens else 0  for token in vocab]
>>> s1_vec
[0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0]

Repeating the same for all sentences:

>>> s2_tokens = [token.lower() for token in word_tokenize(s2)]
>>> s3_tokens = [token.lower() for token in word_tokenize(s3)]
>>> s4_tokens = [token.lower() for token in word_tokenize(s4)]
>>> s2_vec = [1 if token in s2_tokens else 0  for token in vocab]
>>> s3_vec = [1 if token in s3_tokens else 0  for token in vocab]
>>> s4_vec = [1 if token in s4_tokens else 0  for token in vocab]
>>> s2_vec
[0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0]
>>> s3_vec
[1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0]
>>> s4_vec
[1, 0, 0, 0, 1, 1, 0, 1, 0, 1, 0, 0, 0, 1, 0, 0, 0, 0, 1, 0, 1, 1, 0, 0, 0, 1, 1]

Now, given the vectorial form of sentence and words, you can do use similarity functions, e.g. cosine similarity:

>>> from numpy import dot
>>> from numpy.linalg import norm
>>> 
>>> cos_sim = lambda x, y: dot(x,y)/(norm(x)*norm(y))
>>> cos_sim(s1_vec, lecture)
0.5773502691896258
>>> cos_sim(s1_vec, lab)
0.0
>>> cos_sim(s1_vec, exam)
0.0
>>> cos_sim(s1_vec, tutor)
0.0

Now, doing it more systematically:

>>> topics = {'lecture': lecture, 'lab': lab, 'exam': exam, 'tutor':tutor}
>>> topics
{'lecture': [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'lab':     [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'exam':    [0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0], 
 'tutor':   [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0]}


>>> sentences = {'s1':s1_vec, 's2':s2_vec, 's3':s3_vec, 's4':s4_vec}

>>> for s_num, s_vec in sentences.items():
...     print(s_num)
...     for name, topic_vec in topics.items():
...         print('\t', name, cos_sim(s_vec, topic_vec))
... 
s1
     lecture 0.5773502691896258
     lab 0.0
     exam 0.0
     tutor 0.0
s2
     lecture 0.0
     lab 0.0
     exam 0.0
     tutor 0.4082482904638631
s3
     lecture 0.30151134457776363
     lab 0.0
     exam 0.0
     tutor 0.0
s4
     lecture 0.0
     lab 0.30151134457776363
     exam 0.30151134457776363
     tutor 0.0

I guess you get the idea. But we see that the scores are still tied for s4-lab vs s4-exam. So the question becomes, "Is there a way to make them diverge?" and you will jump into the rabbit hole of:

  • How best to represent the sentence/word as a fix-size vector?

  • What similarity function to use to compare "topic"/word vs sentence?

  • What is a "topic"? What does the vector actually represent?

The answer above is what is usually called a one-hot vector to represent the word/sentence. There's lots more complexity than simply comparing strings to "identify a sentences related to a topic?" (aka document clustering/classification). E.g. Can a document/sentence have more than one topic?

Do look up these keywords to further understand the problem "natural language processing", "document classification", "machine learning". Meanwhile, if you don't mind, I guess it's close for this question as "too broad".

Upvotes: 6

vash_the_stampede
vash_the_stampede

Reputation: 4606

Solution

filename = "information.txt"


library = {"lecture": 0, "tutor": 0, "exam": 0}

with open(filename) as f_obj:
    content = f_obj.read() # read text into contents

words  = (content.lower()).split() # create list of all words in content

for k, v in library.items():
    for i in words:
        if k in i:
            v += 1 
            library[k] = v # without this line code count will not update 

for k, v in library.items():
    print(k.title() + ": "  + str(v))

Output

(xenial)vash@localhost:~/pcc/12/alien_invasion_2$ python3 helping_topic.py 
Tutor: 1
Lecture: 2
Exam: 1
(xenial)vash@localhost:~/pcc/12/alien_invasion_2$

This method will count duplicates for you

Enjoy!

Upvotes: 0

emsimpson92
emsimpson92

Reputation: 1778

I'm assuming you're reading from a text file or something. Here is how I would go about doing this.

keywords = {"lecture": 0, "tutor": 0, "exam": 0}

with open("file.txt", "r") as f:
  for line in f:
    for key, value in keywords.items():
      if key in line.lower():
        value += 1

print(keywords)

This searches each line for any word in your keywords dictionary, and if a match is found, it increments the value on that key.

You shouldn't need any external libraries or anything for this.

Upvotes: 0

Frogmonkey
Frogmonkey

Reputation: 45

Just name the variables after the topics you want

lecture = 2
tutor = 1
exam = 1

You can use variable_name += 1 to increment the variable

Upvotes: -2

Related Questions