Creating profiles and using count for dicitonary

Question

this is going to be a hard one to explain but I will try my best.

So I have a text file, which is a paragraph. I have recently converted the paragraph to contain only unique words (no stop word). An example shown here:

'mississippi worth reading about', ' commonplace river contrary ways remarkable', ' considering missouri main branch longest river world--four miles', ' seems safe crookedest river world part journey uses cover ground crow fly six seventy-five', ' discharges water st', ' lawrence twenty-five rhine thirty-eight thames', ' river vast drainage-basin draws water supply twenty-eight states territories delaware atlantic seaboard country idaho pacific slope spread forty-five degrees longitude', ' mississippi receives carries gulf water fifty-four subordinate rivers navigable steamboats hundreds navigable flats keels', ' area drainage-basin combined areas england wales scotland ireland france spain portugal germany austria italy turkey almost wide region fertile mississippi valley proper exceptionally so']

What I have done here is split the paragraph into sentences and removed any punctuations. I then put it into a list.

so for example the list is called temp, If I print out print(temp[0]) it will output this:

'mississippi worth reading about'

Fantastic. However my next step which I am stuck on is I'm trying to create a mini thesaurus using th cosine similarity equation which maybe a few of you are familiar with.

However, first I want to create some profiles. I'll give one profile example as 'River'. In the temp list each element is a sentence. What I want to achieve is for every sentence that contains the word river, create a count of every other word in that sentence.

so if I had 'commonplace river contrary ways remarkable' which is temp[1] the start of the dictionary using the count method would be.

{'commonplace': 1, 'river': 1, 'contrary': 1, 'ways': 1, 'remarkable': 1,}

the first look at the output will be:

river 1 (profile word)
   commonplace: 1
   contrary: 1
   remarkable: 1
   ways: 1

So for every sentence that has river in this should be the final output.

river 4 (profile)
    atlantic: 1
    branch: 1
    commonplace: 1
    considering: 1
    contrary: 1
    country: 1
    cover: 1
    crookedest: 1
    crow: 1
    degrees: 1
    delaware: 1
    drainage-basin: 1
    draws: 1
    fly: 1
    forty-five: 1
    ground: 1
    idaho: 1
    journey: 1
    longest: 1
    longitude: 1
    main: 1
    missouri: 1
    pacific: 1
    part: 1
    remarkable: 1
    safe: 1
    seaboard: 1
    seems: 1
    seventy-five: 1
    six: 1
    slope: 1
    spread: 1
    states: 1
    supply: 1
    territories: 1
    twenty-eight: 1
    uses: 1
    vast: 1
    water: 1
    ways: 1

I'm not sure if it's better to just have a whole list of unique words instead of unique words split into a sentence as an element. For example, this is a set of the above words from the first list.

{'austria', 'fortyfive', 'fiftyfour', 'longest', 'vast', 'almost', 'states', 'region', 'commonplace', 'wide', 'flats', 'main', 'longitude', 'part', 'gulf', 'st', 'contrary', 'missouri', 'pacific', 'hundreds', 'area', 'areas', 'turkey', 'discharges', 'twentyeight', 'fly', 'worth', 'thirtyeight', 'valley', 'seaboard', 'wales', 'ireland', 'ways', 'uses', 'scotland', 'ground', 'river', 'steamboats', 'seventyfive', 'territories', 'safe', 'degrees', 'twentyfive', 'england', 'thames', 'subordinate', 'drainagebasin', 'water', 'considering', 'fertile', 'rivers', 'spread', 'reading', 'combined', 'seems', 'france', 'crookedest', 'drainagebasin:', 'supply', 'rhine', 'portugal', 'six', 'slopea', 'draws', 'exceptionally', 'mississippi', 'idaho', 'worldfour', 'atlantic', 'italy', 'spain', 'receives', 'cover', 'remarkable', 'germany', 'crow', 'delaware', 'country', 'branch', 'carries', 'proper', 'lawrence', 'journey', 'keels', 'navigable'}

I'm sorry if this is a bad explanation but its hard to explain for me. It's the hurdle that is preventing me from using the cosine similarity equation.

Thanks,

EDIT:

unique words only set:

{'remarkable', 'six', 'part', 'navigable', 'england', 'areas', 'worth', 'ways', 'longest', 'lawrence', 'journey', 'longitude', 'austria', 'rivers', 'st', 'crow', 'pacific', 'thirty-eight', 'gulf', 'ireland', 'drainage-basin', 'delaware', 'spread', 'proper', 'subordinate', 'territories', 'germany', 'cover', 'fifty-four', 'slope--a', 'fertile', 'degrees', 'wales', 'seems', 'exceptionally', 'water', 'italy', 'fly', 'missouri', 'turkey', 'atlantic', 'flats', 'hundreds', 'world--four', 'branch', 'twenty-eight', 'main', 'spain', 'receives', 'keels', 'states', 'portugal', 'draws', 'almost', 'contrary', 'seaboard', 'safe', 'mississippi', 'idaho', 'scotland', 'steamboats', 'france', 'valley', 'twenty-five', 'carries', 'wide', 'crookedest', 'area', 'reading', 'rhine', 'discharges', 'uses', 'commonplace', 'combined', 'considering', 'seventy-five', 'river', 'region', 'forty-five', 'ground', 'country', 'vast', 'thames', 'supply'}

My attempt:

for i in unique:
            kw = i
            count_word = [i for i in temp for j in i.split() if j == kw]
            count_dict = {j: i.count(j) for i in count_word for j in i.split() if j != kw}
            print(kw)
            for a, c in sorted(count_dict.items(), key=lambda x: x[0]):
                print('{}: {}'.format(a, c))
            print()

vash_the_stampede · Accepted Answer

For this we could designate kw(keyword) as river then we can use list comprehension to grab all of the items that contain this kw, note some sentences contain rivers so kw in will not work. From here now we can construct a dictionary using dictionary comprehension, we would use j representing each word in i.split() and i.count(j) to represent the count of each word in each item, we will also throw in if j != kw so we don't include river in our list. Finally we can print using for k, v in dicta.items() and if we want can add sorting method to this to get our results alphabetically in order.

kw = 'river'
lista = [i for i in temp for j in i.split() if j == kw]
dicta = {j: i.count(j) for i in lista for j in i.split() if j != kw}

for k, v in sorted(dicta.items(), key=lambda x: x[0]):
    print('{}: {}'.format(k, v))

atlantic: 1
branch: 1
commonplace: 1
considering: 1
contrary: 1
country: 1
...
twenty-eight: 1
uses: 1
vast: 1
water: 1
ways: 1
world: 1
world--four: 1

Expanded loops:

lista = []
for i in temp:
    for j in i.split():
        if j == kw:
            lista.append(i)

dicta = {}
for i in lista:
    for j in i.split():
        dicta[j] = i.count(j)

Addtional Request:

Read all entire file into one variable as string

all_words = 'some string'
all_words = all_words.split()
unique = set(all_words)

for i in unique:
    kw = i
    temp = list of sentences to check against
    rest of existing code
    maybe instead of printing the final statement append the dictionaries created to a list

Creating profiles and using count for dicitonary

Answers (1)

Related Questions