Reputation: 11

How to get rid of empty string in my tokenized output?

After preprocessing tweets I get an empty string as one of the most common tokens. I already tried the re.sub functions with " "(space) and ""(empty string) but can't get rid of the empty string.
I thought it might occour when there are three spaces in a row. So I tried re.sub(r'(?<= ) (?= )' but that didn't work.
I also tried getting the index of the empty string with tokens.index('') and got the output ValueError: '' is not in list, but when copy-pasting the empty string from the output it returned the index.

Any ideas what's going on here?

#Join all Tweets
tweets_joined = " ".join(all_tweets)
   
#Safe urls and remove them
no_links = re.sub(r'https://t.co/\w{10}', ' ', tweets_joined)


#Remove emojis
emojis = re.compile("(["
u"\U0001F600-\U0001F64F"  # emoticons
u"\U000025B0-\U000025BF"  # geometric shapes
u"\U00002190-\U000021FF"  # arrows
u"\U000027A0-\U000027AF"  # arrows
u"\U0001F300-\U0001F5FF"  # symbols & pictographs
u"\U0001F680-\U0001F6FF"  # transport & map symbols
u"\U0001F1E0-\U0001F1FF"  # flags (iOS)
                "])", flags= re.UNICODE)

no_emojis = re.sub(emojis, ' ', no_links)

#put together big numbers (german writing)
safe_numbers = re.sub(r'(?<=\d)\.(?=\d)', '', no_emojis)

#remove html tags
no_newlines = re.sub(r'\n', ' ', safe_numbers)
no_amp = re.sub(r'&amp;', ' ', no_newlines)


#remove punctuation
interpunkt = string.punctuation + "„“–»«´’"
interpunkt = interpunkt.replace("#","")
interpunkt = interpunkt.replace("@","")
no_punct_text = no_amp
for punct in interpunkt:  # durch Interpunktionszeichen in string.punctuation iterieren
        no_punct_text = no_punct_text.replace(punct, ' ')  # Satzzeichen entfernen
        
#remove empty strings (this attempt failed)
no_empty_string = re.sub(r'(?<= ) (?= )', '', no_punct_text)
        
#casefold
text_lower = no_empty_string.casefold()

#tokenize
tokens = nltk.tokenize.word_tokenize(text_lower)

Here's some part of the `tokens` output:

 '#berlin',
 '#bundestag',
 '#brandner',
 '️',
 'wer',
 'hält',
 'sie',
 'auf'

Upvotes: 0

Answers (2)

Rullkali

Reputation: 11

So, I checked the length of the empyt string which was 1. Then I encoded it to Unicode bytes object. Turned out it was a variation selector which is used for variations of symbols, emojis, etc.

Upvotes: 0

alexis

Reputation: 50220

Your workflow is... complicated. But sometimes the simplest or best regexes for tokenization will just generate empty tokens along with the good stuff. Instead of jumping through hoops to avoid empty tokens, just get rid of them by post-processing:

clean_tokens = [ tok for tok in tokens if tok ]

Upvotes: 1

How to get rid of empty string in my tokenized output?

Answers (2)

Related Questions