Reputation: 85
I have a dictionary. Example,
dict = { "wd":"well done", "lol":"laugh out loud"}
Problem is that if there is something like "lol?"
in the text, then it is not expanded at all. Below is the code, I am using to replace the dictionary keys:
def contractions(text, contractions_dict=dict):
for word in text.split():
if word.lower() in contractions_dict:
text = text.replace(word, contractions_dict[word.lower()])
return text
Problem is due to space missing in between 'lol'
and '?'
. How do I resolve this?
After, updated code as suggested is as follows:
dict1 = {
"wd":"well done",
"lol":"laugh out loud"
}
def contractions(text, contractions_dict=dict1):
for key in contractions_dict:
text = text.replace(key, contractions_dict[key])
return text
text = "lol?"
text=contractions(text)
print(text)
Working for the above example, but in long text, this code is making undesired replacements.
Example,lwhyear olduckwhyeahhnt lookingiaandteam effortato representhinking of whyear oldwhyear oldugh lwhyear olduckwhyeahhahandal seato
This is part of the result that I am getting on my actual data. Need help.
Upvotes: 2
Views: 1639
Reputation: 823
Instead of checking to see the text is part of the dictionary, iterate through the dictionary and check if the key is in the text. This is not recommended though as it contains nested loops.
def contractions(text, contractions_dict=dict):
for word in text.split():
for key in contradictions_dict:
if key in word:
text = text.replace(word, contractions_dict[key])
return text
Instead, you might want to just replace every occurrence of every key autistically using the replace method. Replace will automatically find and replace the word. No need to iterate the text yourself.
def contractions(text, contractions_dict=dict):
for key in contradictions_dict:
text = text.replace(key, contractions_dict[key])
return text
Upvotes: 1
Reputation: 11073
There is a better solution if you look visaversa, for each key, replace it in the whole text with the value of that key:
def contractions(text, contractions_dict=dict):
for k, v in contractions_dict.items():
text = text.replace(k,v)
return text
also, note that:
DO NOT use
dict
as a variable name, this name is a built-in in python and you will override its functionality.
The sample input and output:
In [42]: contractions('this is wd and lol?')
Out[42]: 'this is well done and laugh out loud?'
Upvotes: 1
Reputation: 3604
You can solve your problem by using text tokenizer. NLTK library provide many of them such as the WordPunctTokenizer, you can use it as follow:
from nltk.tokenize import WordPunctTokenizer
text = "Good muffins cost $3.88\nin New York. Please buy me\ntwo of them.\n\nThanks."
print(WordPunctTokenizer().tokenize(text))
this will output:
['Good', 'muffins', 'cost', '$', '3', '.', '88', 'in', 'New', 'York',
'.', 'Please', 'buy', 'me', 'two', 'of', 'them', '.', 'Thanks', '.']
As you can notice It can tokenize very complex sentences.
Upvotes: 0
Reputation: 36390
As already noted, .split()
splits only at white spaces, if you wish to extract words and numbers from string, you might use re
module for that task following way:
import re
a = 'This, is. (example) for :testing: 123!'
words = re.findall(r'\w+',a)
print(words) #['This', 'is', 'example', 'for', 'testing', '123']
As you can seen it discards spaces, dots, commas, colons and so on, while keeping sequences consisting of: letters, digits (and underscores _
).
Upvotes: 0
Reputation: 5067
Your mistake comes from the way you split your text. The default case for str.split()
in python splits on white space, which means that "wtf?" is not split.
As you can see in the documentation str.split() can receive a list of separating characters to be used.
You could solve this specific problem by using:
text.split(sep=[' ', '?'])
But most probably, you want many more characters to be used as separation points.
Upvotes: 1