marlon
marlon

Reputation: 7703

How to extract list of pairs in a list enclosed by hash symbols?

For example, from the 'tokens' list below, I want to extract the pair_list:

tokens = ['0', '#', 'a', 'b', '#', '#', 'c', '#',  '#', 'g', 'h', 'g', '#']

pair_list = [['a', 'b'], ['c'],  ['g', 'h', 'g']]

I was trying to do something like below, but hasn't succeeded:

hashToken_begin_found = True
hashToken_end_found = False

previous_token = None

pair_list = []

for token in tokens:

    if hashToken_begin_found and not hashToken_end_found and previous_token and previous_token == '#':
        hashToken_begin_found = False
    elif not hashToken_begin_found:
        if token == '#':
            hashToken_begin_found = True
            hashToken_end_found = True
        else:
            ...

ADDITION:

My actual problem is more complicated. What's inside the pair of # symbols are words in social media, like hashed phrases in twitter, but they are not English. I was simplified the problem to illustrate the problem. The logic would be something like I wrote: found the 'start' and 'end' of each # pair and extract it. In my data, anything in a pair of hash tags is a phrase, i.e. I live in #United States# and #New York#!. I need to get United States and New York. No regex. These words are already in a list.

Upvotes: 0

Views: 99

Answers (3)

no comment
no comment

Reputation: 10465

Another way (Try it online!):

it = iter(tokens)
pair_list = []
while '#' in it:
    pair_list.append(list(iter(it.__next__, '#')))

Yet another (Try it online!):

pair_list = []
try:
    i = 0
    while True:
        i = tokens.index('#', i)
        j = tokens.index('#', i + 1)
        pair_list.append(tokens[i+1 : j])
        i = j + 1
except ValueError:
    pass

Upvotes: 0

Mad Physicist
Mad Physicist

Reputation: 114488

I think you're overcomplicating the issue here. Think of the parser as a very simple state machine. You're either in a sublist or not. Every time you hit a hash, you toggle the state.

When entering a sublist, make a new list. When inside a sublist, append to the current list. That's about it. Here's a sample:

pair_list = []
in_pair = False
for token in tokens:
    if in_pair:
        if token == '#':
            in_pair = False
        else:
            pair_list[-1].append(token)
    elif token == '#':
        pair_list.append([])
        in_pair = True

Upvotes: 2

U13-Forward
U13-Forward

Reputation: 71610

You could try itertools.groupby in one single line:

from itertools import groupby
tokens = ['0', '#', 'a', 'b', '#', '#', 'c', '#',  '#', 'g', 'h', 'g', '#']
print([list(y) for x, y in itertools.groupby(tokens, key=lambda x: x.isalpha()) if x])

Output:

[['a', 'b'], ['c'], ['g', 'h', 'g']]

I group by the consecutive groups where the value is alphabetic.

If you want to use a for loop you could try:

l = [[]]
for i in tokens:
    if i.isalpha():
        l[-1].append(i)        
    else:
        if l[-1]:
            l.append([])
print(l[:-1])

Output:

[['a', 'b'], ['c'], ['g', 'h', 'g']]

Upvotes: 1

Related Questions