How to reconstruct original text from spaCy tokens, even in cases with complicated whitespacing and punctuation

Question

' '.join(token_list) does not reconstruct the original text in cases with multiple whitespaces and punctuation in a row.

For example:

from spacy.tokenizer import Tokenizer
from spacy.lang.en import English

nlp = English()
# Create a blank Tokenizer with just the English vocab
tokenizerSpaCy = Tokenizer(nlp.vocab)

context_text = 'this    is a     test 
 
 		 test for    
 testing  -  ./l 	'

contextSpaCyToksSpaCyObj = tokenizerSpaCy(context_text)
spaCy_toks = [i.text for i in contextSpaCyToksSpaCyObj]

reconstruct = ' '.join(spaCy_toks)
reconstruct == context_text

>False

Is there an established way of reconstructing original text from spaCy tokens?

Established answer should work with this edge case text (you can directly get the source from clicking the 'improve this question' button)

" UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016 RELEASE IN PART B5, B6 From: H Sent: Monday, July 23, 2012 7:26 AM To: 'millscd @state.gov' Cc: 'DanielJJ@state.gov.; 'hanleymr@state.gov' Subject Re: S speech this morning Waiting to hear if Monica can come by and pick up at 8 to take to Josh. If I don't hear from her, can you send B5 someone else? Original Message ---- From: Mills, Cheryl D [MillsCD@state.gov] Sent: Monday, July 23, 2012 07:23 AM To: H Cc: Daniel, Joshua J Subject: FW: S speech this morning See below B5 cdm Original Message From: Shah, Rajiv (AID/A) B6 Sent: Monday, July 23, 2012 7:19 AM To: Mills, Cheryl D Cc: Daniel, Joshua.' Subject: S speech this morning Hi cheryl, I look fwd to attending the speech this morning. I had one last minute request - I understand that in the final version there is no reference to the child survival call to action, but their is a reference to family planning efforts. Could you and josh try to make sure there is some specific reference to the call to action? Also, in terms of acknowledgements it would be good to note torn friedan's leadership as everyone is sensitive to our ghi transition and we want to continue to send the usaid-pepfar-cdc working together public message. I don't know if he is there, but wanted to flag. Look forward to it. Raj UNCLASSIFIED U.S. Department of State Case No. F-2014-20439 Doc No. C05795279 Date: 01/07/2016 \x0c"

Sofie VL · Accepted Answer

You can very easily accomplish this by changing two lines in your code:

spaCy_toks = [i.text + i.whitespace_ for i in contextSpaCyToksSpaCyObj]
reconstruct = ''.join(spaCy_toks)

Basically, each token in spaCy knows whether it is followed by whitespace or not. So you call token.whitespace_ instead of joining them on space by default.

How to reconstruct original text from spaCy tokens, even in cases with complicated whitespacing and punctuation

Answers (1)

Related Questions