Reputation: 89
this is going to be a hard one to explain but I will try my best.
So I have a text file, which is a paragraph. I have recently converted the paragraph to contain only unique words (no stop word). An example shown here:
'mississippi worth reading about', ' commonplace river contrary ways remarkable', ' considering missouri main branch longest river world--four miles', ' seems safe crookedest river world part journey uses cover ground crow fly six seventy-five', ' discharges water st', ' lawrence twenty-five rhine thirty-eight thames', ' river vast drainage-basin draws water supply twenty-eight states territories delaware atlantic seaboard country idaho pacific slope spread forty-five degrees longitude', ' mississippi receives carries gulf water fifty-four subordinate rivers navigable steamboats hundreds navigable flats keels', ' area drainage-basin combined areas england wales scotland ireland france spain portugal germany austria italy turkey almost wide region fertile mississippi valley proper exceptionally so']
What I have done here is split the paragraph into sentences and removed any punctuations. I then put it into a list.
so for example the list is called temp, If I print out print(temp[0]) it will output this:
'mississippi worth reading about'
Fantastic. However my next step which I am stuck on is I'm trying to create a mini thesaurus using th cosine similarity equation which maybe a few of you are familiar with.
However, first I want to create some profiles. I'll give one profile example as 'River'. In the temp list each element is a sentence. What I want to achieve is for every sentence that contains the word river, create a count of every other word in that sentence.
so if I had 'commonplace river contrary ways remarkable'
which is temp[1] the start of the dictionary using the count method would be.
{'commonplace': 1, 'river': 1, 'contrary': 1, 'ways': 1, 'remarkable': 1,}
the first look at the output will be:
river 1 (profile word)
commonplace: 1
contrary: 1
remarkable: 1
ways: 1
So for every sentence that has river in this should be the final output.
river 4 (profile)
atlantic: 1
branch: 1
commonplace: 1
considering: 1
contrary: 1
country: 1
cover: 1
crookedest: 1
crow: 1
degrees: 1
delaware: 1
drainage-basin: 1
draws: 1
fly: 1
forty-five: 1
ground: 1
idaho: 1
journey: 1
longest: 1
longitude: 1
main: 1
missouri: 1
pacific: 1
part: 1
remarkable: 1
safe: 1
seaboard: 1
seems: 1
seventy-five: 1
six: 1
slope: 1
spread: 1
states: 1
supply: 1
territories: 1
twenty-eight: 1
uses: 1
vast: 1
water: 1
ways: 1
I'm not sure if it's better to just have a whole list of unique words instead of unique words split into a sentence as an element. For example, this is a set of the above words from the first list.
{'austria', 'fortyfive', 'fiftyfour', 'longest', 'vast', 'almost', 'states', 'region', 'commonplace', 'wide', 'flats', 'main', 'longitude', 'part', 'gulf', 'st', 'contrary', 'missouri', 'pacific', 'hundreds', 'area', 'areas', 'turkey', 'discharges', 'twentyeight', 'fly', 'worth', 'thirtyeight', 'valley', 'seaboard', 'wales', 'ireland', 'ways', 'uses', 'scotland', 'ground', 'river', 'steamboats', 'seventyfive', 'territories', 'safe', 'degrees', 'twentyfive', 'england', 'thames', 'subordinate', 'drainagebasin', 'water', 'considering', 'fertile', 'rivers', 'spread', 'reading', 'combined', 'seems', 'france', 'crookedest', 'drainagebasin:', 'supply', 'rhine', 'portugal', 'six', 'slopea', 'draws', 'exceptionally', 'mississippi', 'idaho', 'worldfour', 'atlantic', 'italy', 'spain', 'receives', 'cover', 'remarkable', 'germany', 'crow', 'delaware', 'country', 'branch', 'carries', 'proper', 'lawrence', 'journey', 'keels', 'navigable'}
I'm sorry if this is a bad explanation but its hard to explain for me. It's the hurdle that is preventing me from using the cosine similarity equation.
Thanks,
EDIT:
unique words only set:
{'remarkable', 'six', 'part', 'navigable', 'england', 'areas', 'worth', 'ways', 'longest', 'lawrence', 'journey', 'longitude', 'austria', 'rivers', 'st', 'crow', 'pacific', 'thirty-eight', 'gulf', 'ireland', 'drainage-basin', 'delaware', 'spread', 'proper', 'subordinate', 'territories', 'germany', 'cover', 'fifty-four', 'slope--a', 'fertile', 'degrees', 'wales', 'seems', 'exceptionally', 'water', 'italy', 'fly', 'missouri', 'turkey', 'atlantic', 'flats', 'hundreds', 'world--four', 'branch', 'twenty-eight', 'main', 'spain', 'receives', 'keels', 'states', 'portugal', 'draws', 'almost', 'contrary', 'seaboard', 'safe', 'mississippi', 'idaho', 'scotland', 'steamboats', 'france', 'valley', 'twenty-five', 'carries', 'wide', 'crookedest', 'area', 'reading', 'rhine', 'discharges', 'uses', 'commonplace', 'combined', 'considering', 'seventy-five', 'river', 'region', 'forty-five', 'ground', 'country', 'vast', 'thames', 'supply'}
My attempt:
for i in unique:
kw = i
count_word = [i for i in temp for j in i.split() if j == kw]
count_dict = {j: i.count(j) for i in count_word for j in i.split() if j != kw}
print(kw)
for a, c in sorted(count_dict.items(), key=lambda x: x[0]):
print('{}: {}'.format(a, c))
print()
Upvotes: 1
Views: 70
Reputation: 4606
For this we could designate kw(keyword)
as river
then we can use list comprehension to grab all of the items that contain this kw
, note some sentences contain rivers
so kw in
will not work. From here now we can construct a dictionary using dictionary comprehension, we would use j
representing each word in i.split()
and i.count(j)
to represent the count of each word in each item, we will also throw in if j != kw
so we don't include river
in our list. Finally we can print using for k, v in dicta.items()
and if we want can add sorting method to this to get our results alphabetically in order.
kw = 'river'
lista = [i for i in temp for j in i.split() if j == kw]
dicta = {j: i.count(j) for i in lista for j in i.split() if j != kw}
for k, v in sorted(dicta.items(), key=lambda x: x[0]):
print('{}: {}'.format(k, v))
atlantic: 1 branch: 1 commonplace: 1 considering: 1 contrary: 1 country: 1 ... twenty-eight: 1 uses: 1 vast: 1 water: 1 ways: 1 world: 1 world--four: 1
Expanded loops:
lista = []
for i in temp:
for j in i.split():
if j == kw:
lista.append(i)
dicta = {}
for i in lista:
for j in i.split():
dicta[j] = i.count(j)
Addtional Request:
Read all entire file into one variable as string
all_words = 'some string'
all_words = all_words.split()
unique = set(all_words)
for i in unique:
kw = i
temp = list of sentences to check against
rest of existing code
maybe instead of printing the final statement append the dictionaries created to a list
Upvotes: 1