Reputation: 1239
There are so many guides on how to tokenize a sentence, but i didn't find any on how to do the opposite.
import nltk
words = nltk.word_tokenize("I've found a medicine for my disease.")
result I get is: ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
Is there any function than reverts the tokenized sentence to the original state. The function tokenize.untokenize()
for some reason doesn't work.
Edit:
I know that I can do for example this and this probably solves the problem but I am curious is there an integrated function for this:
result = ' '.join(sentence).replace(' , ',',').replace(' .','.').replace(' !','!')
result = result.replace(' ?','?').replace(' : ',': ').replace(' \'', '\'')
Upvotes: 46
Views: 57110
Reputation: 474001
You can use "treebank detokenizer" - TreebankWordDetokenizer
:
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
There is also MosesDetokenizer
which was in nltk
but got removed because of the licensing issues, but it is available as a Sacremoses
standalone package.
Upvotes: 76
Reputation: 26976
from nltk.tokenize.treebank import TreebankWordDetokenizer
TreebankWordDetokenizer().detokenize(['the', 'quick', 'brown'])
# 'The quick brown'
Upvotes: 4
Reputation: 683
The reason there is no simple answer is you actually need the span locations of the original tokens in the string. If you don't have that, and you aren't reverse engineering your original tokenization, your reassembled string is based on guesses about the tokenization rules that were used. If your tokenizer didn't give you spans, you can still do this if you have three things:
1) The original string
2) The original tokens
3) The modified tokens (I'm assuming you have changed the tokens in some way, because that is the only application for this I can think of if you already have #1)
Use the original token set to identify spans (wouldn't it be nice if the tokenizer did that?) and modify the string from back to front so the spans don't change as you go.
Here I'm using TweetTokenizer but it shouldn't matter as long as the tokenizer you use doesn't change the values of your tokens so that they aren't actually in the original string.
tokenizer=nltk.tokenize.casual.TweetTokenizer()
string="One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a horrible vermin."
tokens=tokenizer.tokenize(string)
replacement_tokens=list(tokens)
replacement_tokens[-3]="cute"
def detokenize(string,tokens,replacement_tokens):
spans=[]
cursor=0
for token in tokens:
while not string[cursor:cursor+len(token)]==token and cursor<len(string):
cursor+=1
if cursor==len(string):break
newcursor=cursor+len(token)
spans.append((cursor,newcursor))
cursor=newcursor
i=len(tokens)-1
for start,end in spans[::-1]:
string=string[:start]+replacement_tokens[i]+string[end:]
i-=1
return string
>>> detokenize(string,tokens,replacement_tokens)
'One morning, when Gregor Samsa woke from troubled dreams, he found himself transformed in his bed into a cute vermin.'
Upvotes: 1
Reputation: 11
For me, it worked when I installed python nltk 3.2.5,
pip install -U nltk
then,
import nltk
nltk.download('perluniprops')
from nltk.tokenize.moses import MosesDetokenizer
If you are using insides pandas dataframe, then
df['detoken']=df['token_column'].apply(lambda x: detokenizer.detokenize(x, return_str=True))
Upvotes: 1
Reputation: 3042
I am using following code without any major library function for detokeization purpose. I am using detokenization for some specific tokens
_SPLITTER_ = r"([-.,/:!?\";)(])"
def basic_detokenizer(sentence):
""" This is the basic detokenizer helps us to resolves the issues we created by our tokenizer"""
detokenize_sentence =[]
words = sentence.split(' ')
pos = 0
while( pos < len(words)):
if words[pos] in '-/.' and pos > 0 and pos < len(words) - 1:
left = detokenize_sentence.pop()
detokenize_sentence.append(left +''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in '[(' and pos < len(words) - 1:
detokenize_sentence.append(''.join(words[pos:pos + 2]))
pos +=1
elif words[pos] in ']).,:!?;' and pos > 0:
left = detokenize_sentence.pop()
detokenize_sentence.append(left + ''.join(words[pos:pos + 1]))
else:
detokenize_sentence.append(words[pos])
pos +=1
return ' '.join(detokenize_sentence)
Upvotes: 0
Reputation: 8652
I propose to keep offsets in tokenization: (token, offset). I think, this information is useful for processing over the original sentence.
import re
from nltk.tokenize import word_tokenize
def offset_tokenize(text):
tail = text
accum = 0
tokens = self.tokenize(text)
info_tokens = []
for tok in tokens:
scaped_tok = re.escape(tok)
m = re.search(scaped_tok, tail)
start, end = m.span()
# global offsets
gs = accum + start
ge = accum + end
accum += end
# keep searching in the rest
tail = tail[end:]
info_tokens.append((tok, (gs, ge)))
return info_token
sent = '''I've found a medicine for my disease.
This is line:3.'''
toks_offsets = offset_tokenize(sent)
for t in toks_offsets:
(tok, offset) = t
print (tok == sent[offset[0]:offset[1]]), tok, sent[offset[0]:offset[1]]
Gives:
True I I
True 've 've
True found found
True a a
True medicine medicine
True for for
True my my
True disease disease
True . .
True This This
True is is
True line:3 line:3
True . .
Upvotes: 1
Reputation: 981
use token_utils.untokenize
from here
import re
def untokenize(words):
"""
Untokenizing a text undoes the tokenizing operation, restoring
punctuation and spaces to the places that people expect them to be.
Ideally, `untokenize(tokenize(text))` should be identical to `text`,
except for line breaks.
"""
text = ' '.join(words)
step1 = text.replace("`` ", '"').replace(" ''", '"').replace('. . .', '...')
step2 = step1.replace(" ( ", " (").replace(" ) ", ") ")
step3 = re.sub(r' ([.,:;?!%]+)([ \'"`])', r"\1\2", step2)
step4 = re.sub(r' ([.,:;?!%]+)$', r"\1", step3)
step5 = step4.replace(" '", "'").replace(" n't", "n't").replace(
"can not", "cannot")
step6 = step5.replace(" ` ", " '")
return step6.strip()
tokenized = ['I', "'ve", 'found', 'a', 'medicine', 'for', 'my','disease', '.']
untokenize(tokenized)
"I've found a medicine for my disease."
Upvotes: 6
Reputation: 122142
To reverse word_tokenize
from nltk
, i suggest looking in http://www.nltk.org/_modules/nltk/tokenize/punkt.html#PunktLanguageVars.word_tokenize and do some reverse engineering.
Short of doing crazy hacks on nltk, you can try this:
>>> import nltk
>>> import string
>>> nltk.word_tokenize("I've found a medicine for my disease.")
['I', "'ve", 'found', 'a', 'medicine', 'for', 'my', 'disease', '.']
>>> tokens = nltk.word_tokenize("I've found a medicine for my disease.")
>>> "".join([" "+i if not i.startswith("'") and i not in string.punctuation else i for i in tokens]).strip()
"I've found a medicine for my disease."
Upvotes: 13
Reputation: 2503
The reason tokenize.untokenize
does not work is because it needs more information than just the words. Here is an example program using tokenize.untokenize
:
from StringIO import StringIO
import tokenize
sentence = "I've found a medicine for my disease.\n"
tokens = tokenize.generate_tokens(StringIO(sentence).readline)
print tokenize.untokenize(tokens)
Additional Help:
Tokenize - Python Docs |
Potential Problem
Upvotes: 0
Reputation: 12092
Use the join function:
You could just do a ' '.join(words)
to get back the original string.
Upvotes: -3