Ali
Ali

Reputation: 51

Counting the number of times a unique data double appears in double list python 3

Say I have a double list in python [[],[]]:

doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], 
              ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]

I want to count how many times doublelist[0][0] & doublelist[1][0] = all, the appear in the dual list. With the second [] being the index.

For example you see one count of it at doublelist[0][0] doublelist[1][0] and another at doublelist[0][6] doublelist[1][6].

What code would I use in Python 3 to iterate through doublelist[i][i] grab each value set ex. [["all"],["the"]] and also an integer value for how many times that value set exists in the list.

Ideally I'd like to output it to a triple list triplelist[[i],[i],[i]] that contains the [i][i] value and the integer in the third [i].

Example code:

for i in triplelist[0]:
    print(triplelist[0][i])
    print(triplelist[1][i])
    print(triplelist[2][i])

Output:

>"all"
>"the"
>2
>"the"
>"big"
>1
>"big"
>"dogs"
>1

etc...

Also it would preferably skip duplicates so there wouldn't be 2 indexes in the list where [i][i][i] = [[all],[the],[2]] since there are 2 instances in the original list ([0][0] [1][0] & [0][6] [1][6]). I just want all unique dual sets of words and the amount of times they appear in the original text.

The purpose of the code is to see how often one word follows another word in a given text. It's for building essentially a smart Markov Chain Generator that weights word values. I already have the code to break the text into a dual list that contains the word in the first list and the following word in the second list for this purpose.

Here is my current code for reference (the problem is after I initialize wordlisttriple, I don't know how to make it do what I described above after that):

#import
import re #for regex expression below

#main
with open("text.txt") as rawdata:    #open text file and create a datastream
    rawtext = rawdata.read()    #read through the stream and create a string containing the text
rawdata.close()    #close the datastream
rawtext = rawtext.replace('\n', ' ')    #remove newline characters from text
rawtext = rawtext.replace('\r', ' ')    #remove newline characters from text
rawtext = rawtext.replace('--', ' -- ')    #break up blah--blah words so it can read 2 separate words blah -- blah
pat = re.compile(r'([A-Z][^\.!?]*[\.!?])', re.M)    #regex pattern for grabbing everthing before a sentence ending punctuation
sentencelist = []    #initialize list for sentences in text
sentencelist = pat.findall(rawtext)    #apply regex pattern to string to create a list of all the sentences in the text
firstwordlist = []    #initialize the list for the first word in each sentence
for index, firstword in enumerate(sentencelist):    #enumerate through the sentence list
    sentenceindex = int(index)    #get the index for below operation
    firstword = sentencelist[sentenceindex].split(' ')[0]    #use split to only grab the first word in each sentence
    firstwordlist.append(firstword)    #append each sentence starting word to first word list
rawtext = rawtext.replace(', ', ' , ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('. ', ' . ')    #break up punctuation so they are not considered part of words
rawtext = rawtext.replace('"', ' " ')    #break up punctuation so they are not considered part of words
sentencelistforwords = []    #initialize sentence list for parsing words
sentencelistforwords = pat.findall(rawtext)    #run the regex pattern again this time with the punctuation broken up by spaces
wordsinsentencelist = []    #initialize list for all of the words that appear in each sentence
for index, words in enumerate(sentencelist):    #enumerate through sentence list
    sentenceindex = int(index)    #grab the index for below operation
    words = sentencelist[sentenceindex].split(' ')    #split up the words in each sentence so we have a nested lists that contain each word in each sentence
    wordsinsentencelist.append(words)    #append above described to the list
wordlist = []    #initialize list of all words
wordlist = rawtext.split(' ')    #create list of all words by splitting the entire text by spaces
wordlist = list(filter(None, wordlist))    #use filter to get rid of empty strings in the list
wordlistdouble = [[], []]    #initialize the word list double to contain words and the words that follow them in sentences
for index, word in enumerate(wordlist):    #enumerate through word list
    if(int(index) < int(len(wordlist))-1):    #only go to 1 before the end of list so we don't get an index out of bounds error
        wordlistindex1 = int(index)    #grab index for first word
        wordlistindex2 = int(index)+1    #grab index for following word
        wordlistdouble[0].append(wordlist[wordlistindex1])    #append first word to first list of word list double
        wordlistdouble[1].append(wordlist[wordlistindex2])    #append following word to second list of word list double
wordlisttriple = [[], [], []]    #initialize word list triple
for index, unit in enumerate(wordlistdouble[0]):    #enumerate through word list double
    word1 = wordlistdouble[0][index]    #grab word at first list of word list double at the current index
    word2 = wordlistdouble[1][index]    #grab word at second list of word list double at the current index
    count = 0    #initialize word double data set counter
    wordlisttriple[0].append(word1)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[1].append(word2)    #these need to be encapsulated in some kind of loop/if/for idk
    wordlisttriple[2].append(count)    #these need to be encapsulated in some kind of loop/if/for idk
    #for index, unit1 in enumerate(wordlistdouble[0]):
        #if(wordlistdouble[0][int(index)] == word1 && wordlistdouble[1][int(index)+1] == word2):
            #count++

#sentencelist = list of all sentences
#firstwordlist = list of words that start sentencelist
#sentencelistforwords = list of all sentences mutated for ease of extracting words
#wordsinsentencelist = list of lists containing all of the words in each sentence
#wordlist = list of all words
#wordlistdouble = dual list of all words plus the words that follow them

Any advice would be greatly appreciated. If I'm going about this the wrong way and there is an easier method to accomplish the same thing, that would also be amazing. Thank you!

Upvotes: 2

Views: 407

Answers (3)

juanpa.arrivillaga
juanpa.arrivillaga

Reputation: 96349

So, originally I was going to go with a straightforward approach to generating ngrams:

>>> from collections import Counter
>>> from itertools import chain, islice
>>> from pprint import pprint
>>> def ngram_generator(token_sequence, order):
...     for i in range(len(token_sequence) + 1 - order):
...         yield tuple(token_sequence[i: i + order])
...
>>> counts = Counter(chain.from_iterable(ngram_generator(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
         ('the', 'big'): 2,
         ('chicken', 'all'): 2,
         ('eat', 'paste'): 2,
         ('the', 'small'): 2,
         ('kids', 'eat'): 2,
         ('dogs', 'eat'): 2,
         ('eat', 'chicken'): 2,
         ('small', 'kids'): 2,
         ('big', 'dogs'): 2,
         ('paste', 'lumps'): 1})

But I got inspired by niemmi to write what seems like a more efficient approach, than is again, generalizable to higher order ngrams:

>>> def efficient_ngrams(tokens_sequence, n):
...     iterators = []
...     for i in range(n):
...         it = iter(tokens_sequence)
...         tuple(islice(it, 0, i))
...         iterators.append(it)
...     yield from zip(*iterators)
...

So, observe:

>>> pprint(list(efficient_ngrams(doublelist[0], 1)))
[('all',),
 ('the',),
 ('big',),
 ('dogs',),
 ('eat',),
 ('chicken',),
 ('all',),
 ('the',),
 ('small',),
 ('kids',),
 ('eat',),
 ('paste',)]
>>> pprint(list(efficient_ngrams(doublelist[0], 2)))
[('all', 'the'),
 ('the', 'big'),
 ('big', 'dogs'),
 ('dogs', 'eat'),
 ('eat', 'chicken'),
 ('chicken', 'all'),
 ('all', 'the'),
 ('the', 'small'),
 ('small', 'kids'),
 ('kids', 'eat'),
 ('eat', 'paste')]
>>> pprint(list(efficient_ngrams(doublelist[0], 3)))
[('all', 'the', 'big'),
 ('the', 'big', 'dogs'),
 ('big', 'dogs', 'eat'),
 ('dogs', 'eat', 'chicken'),
 ('eat', 'chicken', 'all'),
 ('chicken', 'all', 'the'),
 ('all', 'the', 'small'),
 ('the', 'small', 'kids'),
 ('small', 'kids', 'eat'),
 ('kids', 'eat', 'paste')]
>>>

And of course, it still works for what you want to accomplish:

>>> counts = Counter(chain.from_iterable(efficient_ngrams(sub, 2) for sub in doublelist))
>>> pprint(counts)
Counter({('all', 'the'): 3,
         ('the', 'big'): 2,
         ('chicken', 'all'): 2,
         ('eat', 'paste'): 2,
         ('the', 'small'): 2,
         ('kids', 'eat'): 2,
         ('dogs', 'eat'): 2,
         ('eat', 'chicken'): 2,
         ('small', 'kids'): 2,
         ('big', 'dogs'): 2,
         ('paste', 'lumps'): 1})
>>>

Upvotes: 1

niemmi
niemmi

Reputation: 17273

Assuming you already have the text parsed to list of words you can just create iterator that starts from second word, zip it with words and run it through Counter:

from collections import Counter

words = ["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]
nxt = iter(words)
next(nxt, None)

print(*Counter(zip(words, nxt)).items(), sep='\n')

Output:

(('big', 'dogs'), 1)
(('kids', 'eat'), 1)
(('small', 'kids'), 1)
(('the', 'big'), 1)
(('dogs', 'eat'), 1)
(('eat', 'paste'), 1)
(('all', 'the'), 2)
(('chicken', 'all'), 1)
(('paste', 'lumps'), 1)
(('eat', 'chicken'), 1)
(('the', 'small'), 1)

In above nxt is an iterator that iterates over the word list. Since we want it to start from second word we pull one word out with next before using it:

>>> nxt = iter(words)
>>> next(nxt)
'all'
>>> list(nxt)
['the', 'big', 'dogs', 'eat', 'chicken', 'all', 'the', 'small', 'kids', 'eat', 'paste', 'lumps']

Then we pass the original list and iterator to zip that will return iterable of tuples where each tuple has one item from both:

>>> list(zip(words, nxt))
[('all', 'the'), ('the', 'big'), ('big', 'dogs'), ('dogs', 'eat'), ('eat', 'chicken'), ('chicken', 'all'), ('all', 'the'), ('the', 'small'), ('small', 'kids'), ('kids', 'eat'), ('eat', 'paste'), ('paste', 'lumps')]

Finally the output from zip is passed to Counter that counts each pair and returns dict like object where keys are pairs and values are counts:

>>> Counter(zip(words, nxt))
Counter({('all', 'the'): 2, ('eat', 'chicken'): 1, ('big', 'dogs'): 1, ('small', 'kids'): 1, ('kids', 'eat'): 1, ('paste', 'lumps'): 1, ('chicken', 'all'): 1, ('dogs', 'eat'): 1, ('the', 'big'): 1, ('the', 'small'): 1, ('eat', 'paste'): 1})

Upvotes: 5

Shivkumar kondi
Shivkumar kondi

Reputation: 6782

If you are looking for only all and the word ,this could be helpful to you.

Code :

from collections import Counter
doublelist = [["all", "the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste"], ["the", "big", "dogs", "eat", "chicken", "all", "the", "small", "kids", "eat", "paste", "lumps"]]
for i in range(len(doublelist)):
    count = Counter(doublelist[i])
    print "List {} - all = {},the = {}".format(i,count['all'],count['the'])

Output :

List 0 - all = 2,the = 2
List 1 - all = 1,the = 2

Upvotes: 0

Related Questions