Reputation: 7703
For example, from the 'tokens' list below, I want to extract the pair_list:
tokens = ['0', '#', 'a', 'b', '#', '#', 'c', '#', '#', 'g', 'h', 'g', '#']
pair_list = [['a', 'b'], ['c'], ['g', 'h', 'g']]
I was trying to do something like below, but hasn't succeeded:
hashToken_begin_found = True
hashToken_end_found = False
previous_token = None
pair_list = []
for token in tokens:
if hashToken_begin_found and not hashToken_end_found and previous_token and previous_token == '#':
hashToken_begin_found = False
elif not hashToken_begin_found:
if token == '#':
hashToken_begin_found = True
hashToken_end_found = True
else:
...
ADDITION:
My actual problem is more complicated. What's inside the pair of # symbols are words in social media, like hashed phrases in twitter, but they are not English. I was simplified the problem to illustrate the problem. The logic would be something like I wrote: found the 'start' and 'end' of each # pair and extract it. In my data, anything in a pair of hash tags is a phrase, i.e. I live in #United States# and #New York#!. I need to get United States and New York. No regex. These words are already in a list.
Upvotes: 0
Views: 99
Reputation: 10465
Another way (Try it online!):
it = iter(tokens)
pair_list = []
while '#' in it:
pair_list.append(list(iter(it.__next__, '#')))
Yet another (Try it online!):
pair_list = []
try:
i = 0
while True:
i = tokens.index('#', i)
j = tokens.index('#', i + 1)
pair_list.append(tokens[i+1 : j])
i = j + 1
except ValueError:
pass
Upvotes: 0
Reputation: 114488
I think you're overcomplicating the issue here. Think of the parser as a very simple state machine. You're either in a sublist or not. Every time you hit a hash, you toggle the state.
When entering a sublist, make a new list. When inside a sublist, append to the current list. That's about it. Here's a sample:
pair_list = []
in_pair = False
for token in tokens:
if in_pair:
if token == '#':
in_pair = False
else:
pair_list[-1].append(token)
elif token == '#':
pair_list.append([])
in_pair = True
Upvotes: 2
Reputation: 71610
You could try itertools.groupby
in one single line:
from itertools import groupby
tokens = ['0', '#', 'a', 'b', '#', '#', 'c', '#', '#', 'g', 'h', 'g', '#']
print([list(y) for x, y in itertools.groupby(tokens, key=lambda x: x.isalpha()) if x])
Output:
[['a', 'b'], ['c'], ['g', 'h', 'g']]
I group by the consecutive groups where the value is alphabetic.
If you want to use a for
loop you could try:
l = [[]]
for i in tokens:
if i.isalpha():
l[-1].append(i)
else:
if l[-1]:
l.append([])
print(l[:-1])
Output:
[['a', 'b'], ['c'], ['g', 'h', 'g']]
Upvotes: 1