connor449
connor449

Reputation: 1679

How to create custom one hot encoding by keywords on text sequences

I have a list of text sequences that look like this:

    sequences = [
    ['okay', ''],
    ['ahead', 'fred', ''],
    ['i', 'dont', 'remember', 'you', 'want', 'to', 'go', ''],
    ['um', ''],
    ['let', 'me', 'think', '']
]

I want to create a one hot vector for each sequence that counts the occurrence of certain words from a list. The list of words to look for is here:


    keywords = ['i', 'you', 'we']

Ultimately, I want to loop through each text sequence and return the following (where 0 means the keyword was not present and 1 means it was):


    seq_to_vec = [
        [0,0,0],
        [0,0,0],
        [1,1,0],
        [0,0,0],
        [0,0,0]
    ]

How do I do this?

Upvotes: 0

Views: 81

Answers (2)

paxdiablo
paxdiablo

Reputation: 881113

That's a fairly simple (well, simple for Python) list comprehension:

[[1 if keyword in sequence else 0 for keyword in keywords] for sequence in sequences]

The following complete program shows this in action:

sequences = [
    ['okay', ''],
    ['ahead', 'fred', ''],
    ['i', 'dont', 'remember', 'you', 'want', 'to', 'go', ''],
    ['um', ''],
    ['let', 'me', 'think', '']
]
keywords = ['i', 'you', 'we']

print([[1 if keyword in sequence else 0 for keyword in keywords] for sequence in sequences])

As expected, the output is:

[[0, 0, 0], [0, 0, 0], [1, 1, 0], [0, 0, 0], [0, 0, 0]]

Note that this is based on your "where 0 means the keyword was not present and 1 means it was" text, meaning it doesn't cater for the same word appearing twice. If you duplicate i in the third sequence, you'll still only get 1 in that position rather than 2.

If you want an actual count rather than a 0/1 presence indicator (based on your "counts the occurrence" text), it's a little more complex, but still using the same basic idea:

[[sum([1 if keyword == word else 0 for word in sequence]) for keyword in keywords] for sequence in sequences]

Duplicating i in the third sequence will then deliver you:

[[0, 0, 0], [0, 0, 0], [2, 1, 0], [0, 0, 0], [0, 0, 0]]

Upvotes: 1

Reuven Chacha
Reuven Chacha

Reputation: 889

Here's a possible solution, using list comprehension and the list count() method

def sequences_to_num_of_ocuerences_vector(sequences, keywords):
    return [[seq.count(k) for k in keywords] for seq in sequences]

Upvotes: 0

Related Questions