Reputation: 1679
I have a list of text sequences that look like this:
sequences = [
['okay', ''],
['ahead', 'fred', ''],
['i', 'dont', 'remember', 'you', 'want', 'to', 'go', ''],
['um', ''],
['let', 'me', 'think', '']
]
I want to create a one hot vector for each sequence that counts the occurrence of certain words from a list. The list of words to look for is here:
keywords = ['i', 'you', 'we']
Ultimately, I want to loop through each text sequence and return the following (where 0 means the keyword was not present and 1 means it was):
seq_to_vec = [
[0,0,0],
[0,0,0],
[1,1,0],
[0,0,0],
[0,0,0]
]
How do I do this?
Upvotes: 0
Views: 81
Reputation: 881113
That's a fairly simple (well, simple for Python) list comprehension:
[[1 if keyword in sequence else 0 for keyword in keywords] for sequence in sequences]
The following complete program shows this in action:
sequences = [
['okay', ''],
['ahead', 'fred', ''],
['i', 'dont', 'remember', 'you', 'want', 'to', 'go', ''],
['um', ''],
['let', 'me', 'think', '']
]
keywords = ['i', 'you', 'we']
print([[1 if keyword in sequence else 0 for keyword in keywords] for sequence in sequences])
As expected, the output is:
[[0, 0, 0], [0, 0, 0], [1, 1, 0], [0, 0, 0], [0, 0, 0]]
Note that this is based on your "where 0 means the keyword was not present and 1 means it was" text, meaning it doesn't cater for the same word appearing twice. If you duplicate i
in the third sequence, you'll still only get 1
in that position rather than 2
.
If you want an actual count rather than a 0/1
presence indicator (based on your "counts the occurrence" text), it's a little more complex, but still using the same basic idea:
[[sum([1 if keyword == word else 0 for word in sequence]) for keyword in keywords] for sequence in sequences]
Duplicating i
in the third sequence will then deliver you:
[[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0]]
Upvotes: 1
Reputation: 889
Here's a possible solution, using list comprehension and the list count() method
def sequences_to_num_of_ocuerences_vector(sequences, keywords):
return [[seq.count(k) for k in keywords] for seq in sequences]
Upvotes: 0