utengr
utengr

Reputation: 3355

remove words that consists of numbers and string punctuations from text

Is there a better/efficient way to remove words that consist of either numbers, or string punctuations or a combination these two from some text?

example 1:

 input_text: 'This is a test number +223/34 and this a real number 2333.'
    expected_output =  'This is a test number and this a real number .'

example 2:

input_text: 'This is a test-number +223/34 and this a real number 2333. The email is [email protected] and the website is www.test.com which sells 3-D products'
    expected_output =  'This is a test-number and this a real number . The email is [email protected] and the website is www.test.com which sells 3-D products.

Currently, I have something like this.

def is_valid_word(word):
    return not word.translate(str.maketrans('', '', string.punctuation)).isdigit()

clean_text    = " ".join([word for word in input_text.split() if is_valid_word(word)])

Upvotes: 0

Views: 353

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627488

You can simply check if a token is alphanumeric:

clean_text  = " ".join([word for word in input_text.split() if word.isalnum()])

See a Python demo:

input_text = 'This is a test number +223/34 and this a real number 2333.'
print( " ".join([word for word in input_text.split() if word.isalnum()]) )
# => This is a test number and this a real number

If you have specific patterns in mind, you can write specific regex patterns to find the matching strings and delete them with re.sub. For example, if you want to remove standalone numbers that can contain match operators between them, or dots/commas, you can use the following:

import re
input_text = 'This is a test number +223/34 and this a real number 2333. The email is [email protected] and the website is www.test.com.'
print( re.sub(r'[-+]?\b\d+(?:[.,+/*-]\d+)*\b', '', input_text) )

that yields the expected:

This is a test number  and this a real number . The email is [email protected] and the website is www.test.com.

See the Python demo. The regex means

  • [-+]? - an optional + or -
  • \b - a word boundary (the digit cannot be glued to a word)
  • \d+ - one or more digits
  • (?:[.,+/*-]\d+)* - zero or more repetitions of . / , / +, /, *, - and then one or more digits
  • \b - a word boundary (the digit cannot be glued to a word).

Upvotes: 1

Liutprand
Liutprand

Reputation: 557

Your method seems ok, but if you want to use a regex (like the tag suggests) you can use this to capture all the characters that are not lower/uppercase letters or spaces:

[^a-zA-Z ]*

Then you can replace with an empty string.

import re

input_text = "This is a test number +223/34 and this a real number 2333."
clean_text=re.sub("[^a-zA-Z ]*", "", input_text)

Upvotes: 1

Related Questions