Reputation: 23
I'm trying to create an algo which goes through a list of strings, joins strings together if they meet a certain criteria, then skips by the number of strings it joined to avoid double counting of sections of the same joined string.
I understand i = i + x or i += x doesnt change the amount each loop iterates by, so am looking for an alternative method to skip a number of iterations by a variable.
Background: Im trying to create a Named Entity recognition algo for use in news articles. I tokenise the text ('Prime Minister Jacinda Ardern is from New Zealand')
into ('Prime','Minister','Jacinda','Ardern','is'...)
and run the NLTK POS tagging algo over it giving : ...(('Jacinda','NNP'),('Ardern','NNP'),('is','VBZ')...
then combine words when subsequent words are also 'NNP' /proper nouns.
The goal is to count 'Prime Minister Jacinda Ardern' as 1 string as opposed to 4, then to skip the loop iteration by as many words to avoid the next string being 'Minister Jacinda Ardern' and then 'Jacinda Ardern'.
Context:
'text' is a list of lists created by tokenising and then POS tagging my article and is in the format: [...('She', 'PRP'), ('said', 'VBD'), ('the', 'DT'), ('roughly', 'RB'), ('25-minute', 'JJ'), ('meeting', 'NN')...]
'NNP' = proper noun or the names of places/people/organisations etc.
for (i) in range(len(text)):
print(i)
#initialising wordcounter as a variable
wordcounter = 0
# if text[i] is a Proper Noun, make namedEnt = the word.
# then increase wordcounter by 1
if text[i][1] == 'NNP':
namedEnt = text[i][0]
wordcounter +=1
# while the next word in text is also a Proper Noun,
# increase wordcounter by 1. Initialise J as = 1
while text[i + wordcounter][1] == 'NNP':
wordcounter +=1
j = 1
# While J is less than wordcounter, join text[i+j] to
# namedEnt. Increase J by 1. When that is no longer
# the case append namedEnt to a namedEntity list
while j < wordcounter:
namedEnt = ' '.join([namedEnt,text[i+j][0]])
j += 1
InitialNamedEntity.append(namedEnt)
i += wordcounter
If I print(i)
at the start it goes up by 1 at a time. When I print the Counter of the NamedEntity list made up of namedEnts, i
results as follows:
(...'New Zealand': 7, 'Zealand': 7, 'United': 4, 'Prime Minister Minister Jacinda Minister Jacinda Ardern': 3...)
So im not only getting double counts as in 'New Zealand' and 'Zealand', but im also getting wacky results like 'Prime Minister Minister Jacinda Minister Jacinda Ardern'.
The results I would like would be ('New Zealand':7, 'United States':4,'Prime Minister Jacinda Ardern':3)
Any help would be greatly appreciated. Cheers
Upvotes: 1
Views: 75
Reputation: 23
Thanks for the help everyone. I used the while loop shown by Barmar:
i = 0
while i < len(text):
i += wordcounter
and at the end used an if else statement:
if wordcounter > 0: i += wordcounter
else: i += 1
Upvotes: 0
Reputation: 117
range() creates an iterable object. The for...in construct calls a next method on it and each time next returns the next value in the sequence. So your i variable is not the index in that sequence, it's just the next value produced by the iterator. Modifying i has no effect, it will just be overwritten when the next value is retrieved from the sequence.
This is very different from a loop like for (int i = 0; i < 5; i++) {}
in C, where there is no concept of a sequence; that just checks if i less than five before executing the block.
Compare it to this:
for i in {2,-1,-4}:
print(i)
i = i + 2
Perhaps here it is more obvious that setting i will have no effect.
But that C-like construct, you can do that in Python too. As follows:
i = 0
while i < 6:
print(i)
if i == 2:
i = i + 2
else:
i = i + 1
This prints
0 1 2 4 5
See how it didn't output 3? When it got to i == 2, it added 2 so it skipped over 3. You can do something similar in your code.
(these examples were Python 3)
Upvotes: 1
Reputation: 781380
Don't use a for
loop if you need to adjust how i
is incremented, as it always sets it to the next value in the range. Use a while
loop:
i = 0
while i < len(text):
...
i += wordcounter
Upvotes: 1