Reputation: 1

How to turn a list of bigrams to a list of tokens using Python

I want to turn a list of bigrams to a list of tokens using Python 3.6.

I have something like:

input_list = [(‘hi’, ‘my’), (‘my’, ‘name’), (‘name’, ‘is’), (‘is’, ‘x’)]

I want to turn this to:

output_list = [‘hi’, ‘my’, ‘name’, ‘is’, ‘x’]

Upvotes: 0

Answers (3)

Sayandip Dutta

Reputation: 15872

If you do not want to create a separate list to store the flattened values, and save space and avoid loops you may try this:

from itertools import chain
lst = [('hi', 'my'), ('my', 'name'), ('name', 'is'), ('is', 'x')]
flattened = chain(*lst)
elems = list(dict.fromkeys(flattened).keys())
print(elems)

Here chain(*lst) basically unpacks the elements and flattens the list, and stores it in a iterator object, as opposed to actually storing as list. Then you can convert those to set and back, but they may mess the ordering. So you take all those values and try to convert them to keys of dictionary. As dictionaries cannot have duplicate keys, it will only take the unique elements. So if you take the keys of that dict, you will get the unique elements from the flattened list. NOTE: The order is guaranteed to be maintained from Python 3.7.

Upvotes: 0

Daweo

Reputation: 36450

If all input follow that structure I would extract first part of first tuple, then last element from every tuple, that is:

input_list = [("hi", "my"), ("my", "name"), ("name", "is"), ("is", "x")]
output_list = [input_list[0][0]]+[i[-1] for i in input_list]
print(output_list) # ['hi', 'my', 'name', 'is', 'x']

I used followed python features:

indexing, [0][0] means first element of first element (if that is not clear I suggest searching for nesting first), [-1] means last element (first element starting from end)
list comprehension, to get last element of every element of list
list concatenation (denoted by +) to "glue" two lists together

Upvotes: 0

Mack123456

Reputation: 386

You can start with using a list comprehension to flatten the list and then take a set of that:

flat_list = [x for sublist in input_list for x in sublist]
output_list = set(flat_list)
output_list

{'hi', 'is', 'my', 'name', 'x'}

Upvotes: 1

How to turn a list of bigrams to a list of tokens using Python

Answers (3)

Related Questions