Reputation: 51
Say I have a double list in python [[],[]]
:
doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"],
["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]
I want to count how many times doublelist[0][0] & doublelist[1][0] = all, the
appear in the dual list. With the second [] being the index.
For example you see one count of it at doublelist[0][0] doublelist[1][0]
and another at doublelist[0][6] doublelist[1][6]
.
What code would I use in Python 3 to iterate through doublelist[i][i]
grab each value set ex. [["all"],["the"]]
and also an integer value for how many times that value set exists in the list.
Ideally I'd like to output it to a triple list triplelist[[i],[i],[i]]
that contains the [i][i]
value and the integer in the third [i]
.
Example code:
for i in triplelist[0]:
print(triplelist[0][i])
print(triplelist[1][i])
print(triplelist[2][i])
Output:
>"all"
>"the"
>2
>"the"
>"big"
>1
>"big"
>"dogs"
>1
etc...
Also it would preferably skip duplicates so there wouldn't be 2 indexes in the list where [i][i][i] = [[all],[the],[2]]
since there are 2 instances in the original list ([0][0] [1][0] & [0][6] [1][6]). I just want all unique dual sets of words and the amount of times they appear in the original text.
The purpose of the code is to see how often one word follows another word in a given text. It's for building essentially a smart Markov Chain Generator that weights word values. I already have the code to break the text into a dual list that contains the word in the first list and the following word in the second list for this purpose.
Here is my current code for reference (the problem is after I initialize wordlisttriple, I don't know how to make it do what I described above after that):
#import
import re #for regex expression below
#main
with open("text.txt") as rawdata: #open text file and create a datastream
rawtext = rawdata.read() #read through the stream and create a string containing the text
rawdata.close() #close the datastream
rawtext = rawtext.replace('\n', ' ') #remove newline characters from text
rawtext = rawtext.replace('\r', ' ') #remove newline characters from text
rawtext = rawtext.replace('--', ' -- ') #break up blah--blah words so it can read 2 separate words blah -- blah
pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M) #regex pattern for grabbing everthing before a sentence ending punctuation
sentencelist = [] #initialize list for sentences in text
sentencelist = pat.findall(rawtext) #apply regex pattern to string to create a list of all the sentences in the text
firstwordlist = [] #initialize the list for the first word in each sentence
for index, firstword in enumerate(sentencelist): #enumerate through the sentence list
sentenceindex = int(index) #get the index for below operation
firstword = sentencelist[sentenceindex].split(' ')[0] #use split to only grab the first word in each sentence
firstwordlist.append(firstword) #append each sentence starting word to first word list
rawtext = rawtext.replace(', ', ' , ') #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('. ', ' . ') #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('"', ' " ') #break up punctuation so they are not considered part of words
sentencelistforwords = [] #initialize sentence list for parsing words
sentencelistforwords = pat.findall(rawtext) #run the regex pattern again this time with the punctuation broken up by spaces
wordsinsentencelist = [] #initialize list for all of the words that appear in each sentence
for index, words in enumerate(sentencelist): #enumerate through sentence list
sentenceindex = int(index) #grab the index for below operation
words = sentencelist[sentenceindex].split(' ') #split up the words in each sentence so we have a nested lists that contain each word in each sentence
wordsinsentencelist.append(words) #append above described to the list
wordlist = [] #initialize list of all words
wordlist = rawtext.split(' ') #create list of all words by splitting the entire text by spaces
wordlist = list(filter(None, wordlist)) #use filter to get rid of empty strings in the list
wordlistdouble = [[], []] #initialize the word list double to contain words and the words that follow them in sentences
for index, word in enumerate(wordlist): #enumerate through word list
if(int(index) < int(len(wordlist))-1): #only go to 1 before the end of list so we don't get an index out of bounds error
wordlistindex1 = int(index) #grab index for first word
wordlistindex2 = int(index)+1 #grab index for following word
wordlistdouble[0].append(wordlist[wordlistindex1]) #append first word to first list of word list double
wordlistdouble[1].append(wordlist[wordlistindex2]) #append following word to second list of word list double
wordlisttriple = [[], [], []] #initialize word list triple
for index, unit in enumerate(wordlistdouble[0]): #enumerate through word list double
word1 = wordlistdouble[0][index] #grab word at first list of word list double at the current index
word2 = wordlistdouble[1][index] #grab word at second list of word list double at the current index
count = 0 #initialize word double data set counter
wordlisttriple[0].append(word1) #these need to be encapsulated in some kind of loop/if/for idk
wordlisttriple[1].append(word2) #these need to be encapsulated in some kind of loop/if/for idk
wordlisttriple[2].append(count) #these need to be encapsulated in some kind of loop/if/for idk
#for index, unit1 in enumerate(wordlistdouble[0]):
#if(wordlistdouble[0][int(index)] == word1 && wordlistdouble[1][int(index)+1] == word2):
#count++
#sentencelist = list of all sentences
#firstwordlist = list of words that start sentencelist
#sentencelistforwords = list of all sentences mutated for ease of extracting words
#wordsinsentencelist = list of lists containing all of the words in each sentence
#wordlist = list of all words
#wordlistdouble = dual list of all words plus the words that follow them
Any advice would be greatly appreciated. If I'm going about this the wrong way and there is an easier method to accomplish the same thing, that would also be amazing. Thank you!
Upvotes: 2
Views: 407
Reputation: 96349
So, originally I was going to go with a straightforward approach to generating ngrams:
>>> from collections import Counter
>>> from itertools import chain, islice
>>> from pprint import pprint
>>> def ngram_generator(token_sequence, order):
... for i in range(len(token_sequence) + 1 - order):
... yield tuple(token_sequence[i: i + order])
...
>>> counts = Counter(chain.from_iterable(ngram_generator(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
('the', 'big'): 2,
('chicken', 'all'): 2,
('eat', 'paste'): 2,
('the', 'small'): 2,
('kids', 'eat'): 2,
('dogs', 'eat'): 2,
('eat', 'chicken'): 2,
('small', 'kids'): 2,
('big', 'dogs'): 2,
('paste', 'lumps'): 1})
But I got inspired by niemmi to write what seems like a more efficient approach, than is again, generalizable to higher order ngrams:
>>> def efficient_ngrams(tokens_sequence, n):
... iterators = []
... for i in range(n):
... it = iter(tokens_sequence)
... tuple(islice(it, 0, i))
... iterators.append(it)
... yield from zip(*iterators)
...
So, observe:
>>> pprint(list(efficient_ngrams(doublelist[0], 1)))
[('all',),
('the',),
('big',),
('dogs',),
('eat',),
('chicken',),
('all',),
('the',),
('small',),
('kids',),
('eat',),
('paste',)]
>>> pprint(list(efficient_ngrams(doublelist[0], 2)))
[('all', 'the'),
('the', 'big'),
('big', 'dogs'),
('dogs', 'eat'),
('eat', 'chicken'),
('chicken', 'all'),
('all', 'the'),
('the', 'small'),
('small', 'kids'),
('kids', 'eat'),
('eat', 'paste')]
>>> pprint(list(efficient_ngrams(doublelist[0], 3)))
[('all', 'the', 'big'),
('the', 'big', 'dogs'),
('big', 'dogs', 'eat'),
('dogs', 'eat', 'chicken'),
('eat', 'chicken', 'all'),
('chicken', 'all', 'the'),
('all', 'the', 'small'),
('the', 'small', 'kids'),
('small', 'kids', 'eat'),
('kids', 'eat', 'paste')]
>>>
And of course, it still works for what you want to accomplish:
>>> counts = Counter(chain.from_iterable(efficient_ngrams(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
('the', 'big'): 2,
('chicken', 'all'): 2,
('eat', 'paste'): 2,
('the', 'small'): 2,
('kids', 'eat'): 2,
('dogs', 'eat'): 2,
('eat', 'chicken'): 2,
('small', 'kids'): 2,
('big', 'dogs'): 2,
('paste', 'lumps'): 1})
>>>
Upvotes: 1
Reputation: 17273
Assuming you already have the text parsed to list of words you can just create iterator that starts from second word, zip
it with words and run it through Counter
:
from collections import Counter
words = ["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]
nxt = iter(words)
next(nxt, None)
print(*Counter(zip(words, nxt)).items(), sep='\n')
Output:
(('big', 'dogs'), 1)
(('kids', 'eat'), 1)
(('small', 'kids'), 1)
(('the', 'big'), 1)
(('dogs', 'eat'), 1)
(('eat', 'paste'), 1)
(('all', 'the'), 2)
(('chicken', 'all'), 1)
(('paste', 'lumps'), 1)
(('eat', 'chicken'), 1)
(('the', 'small'), 1)
In above nxt
is an iterator that iterates over the word list. Since we want it to start from second word we pull one word out with next
before using it:
>>> nxt = iter(words)
>>> next(nxt)
'all'
>>> list(nxt)
['the', 'big', 'dogs', 'eat', 'chicken', 'all', 'the', 'small', 'kids', 'eat', 'paste', 'lumps']
Then we pass the original list and iterator to zip
that will return iterable of tuples where each tuple has one item from both:
>>> list(zip(words, nxt))
[('all', 'the'), ('the', 'big'), ('big', 'dogs'), ('dogs', 'eat'), ('eat', 'chicken'), ('chicken', 'all'), ('all', 'the'), ('the', 'small'), ('small', 'kids'), ('kids', 'eat'), ('eat', 'paste'), ('paste', 'lumps')]
Finally the output from zip
is passed to Counter
that counts each pair and returns dict
like object where keys are pairs and values are counts:
>>> Counter(zip(words, nxt))
Counter({('all', 'the'): 2, ('eat', 'chicken'): 1, ('big', 'dogs'): 1, ('small', 'kids'): 1, ('kids', 'eat'): 1, ('paste', 'lumps'): 1, ('chicken', 'all'): 1, ('dogs', 'eat'): 1, ('the', 'big'): 1, ('the', 'small'): 1, ('eat', 'paste'): 1})
Upvotes: 5
Reputation: 6782
If you are looking for only all and the word ,this could be helpful to you.
Code :
from collections import Counter
doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]
for i in range(len(doublelist)):
count = Counter(doublelist[i])
print "List {} - all = {},the = {}".format(i,count['all'],count['the'])
Output :
List 0 - all = 2,the = 2
List 1 - all = 1,the = 2
Upvotes: 0