user1992989
user1992989

Reputation: 85

Python dictionary not matching keys as desired

I have a dictionary. Example,

dict = { "wd":"well done", "lol":"laugh out loud"}

Problem is that if there is something like "lol?" in the text, then it is not expanded at all. Below is the code, I am using to replace the dictionary keys:

def contractions(text, contractions_dict=dict):
    for word in text.split():
        if word.lower() in contractions_dict:
            text = text.replace(word, contractions_dict[word.lower()])
    return text

Problem is due to space missing in between 'lol' and '?'. How do I resolve this?

After, updated code as suggested is as follows:

 dict1 = {
          "wd":"well done",
          "lol":"laugh out loud"
         }

 def contractions(text, contractions_dict=dict1):
     for key in contractions_dict:
         text = text.replace(key, contractions_dict[key])
     return text

 text = "lol?"
 text=contractions(text)
 print(text)

Working for the above example, but in long text, this code is making undesired replacements.

Example,lwhyear olduckwhyeahhnt lookingiaandteam effortato representhinking of whyear oldwhyear oldugh lwhyear olduckwhyeahhahandal seato

This is part of the result that I am getting on my actual data. Need help.

Upvotes: 2

Views: 1639

Answers (5)

Eduardo Morales
Eduardo Morales

Reputation: 823

Instead of checking to see the text is part of the dictionary, iterate through the dictionary and check if the key is in the text. This is not recommended though as it contains nested loops.

def contractions(text, contractions_dict=dict):
    for word in text.split():
        for key in contradictions_dict:
            if key in word:
                text = text.replace(word, contractions_dict[key])
    return text

Instead, you might want to just replace every occurrence of every key autistically using the replace method. Replace will automatically find and replace the word. No need to iterate the text yourself.

def contractions(text, contractions_dict=dict):
    for key in contradictions_dict:
        text = text.replace(key, contractions_dict[key])
    return text

Upvotes: 1

Mehrdad Pedramfar
Mehrdad Pedramfar

Reputation: 11073

There is a better solution if you look visaversa, for each key, replace it in the whole text with the value of that key:

def contractions(text, contractions_dict=dict):
    for k, v in contractions_dict.items():
        text = text.replace(k,v)
    return text

also, note that:

DO NOT use dict as a variable name, this name is a built-in in python and you will override its functionality.

The sample input and output:

In [42]: contractions('this is wd and lol?')
Out[42]: 'this is well done and laugh out loud?'

Upvotes: 1

adnanmuttaleb
adnanmuttaleb

Reputation: 3604

You can solve your problem by using text tokenizer. NLTK library provide many of them such as the WordPunctTokenizer, you can use it as follow:

from nltk.tokenize import WordPunctTokenizer
text = "Good muffins cost $3.88\nin New York.  Please buy me\ntwo of them.\n\nThanks."
print(WordPunctTokenizer().tokenize(text))

this will output:

    ['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']

As you can notice It can tokenize very complex sentences.

Upvotes: 0

Daweo
Daweo

Reputation: 36390

As already noted, .split() splits only at white spaces, if you wish to extract words and numbers from string, you might use re module for that task following way:

import re
a = 'This, is. (example) for :testing: 123!'
words = re.findall(r'\w+',a)
print(words) #['This', 'is', 'example', 'for', 'testing', '123']

As you can seen it discards spaces, dots, commas, colons and so on, while keeping sequences consisting of: letters, digits (and underscores _).

Upvotes: 0

João Almeida
João Almeida

Reputation: 5067

Your mistake comes from the way you split your text. The default case for str.split() in python splits on white space, which means that "wtf?" is not split.

As you can see in the documentation str.split() can receive a list of separating characters to be used.

You could solve this specific problem by using:

text.split(sep=[' ', '?'])

But most probably, you want many more characters to be used as separation points.

Upvotes: 1

Related Questions