user11749375
user11749375

Reputation:

Preprocessing to get rid of not hyphen but dash in sentences

What I would like to do

I would like to get rid of not hyphen but dash in sentences for NLP preprocessing.

Input

samples = [
    'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
    'He is afraid of two things — spiders and senior prom.' #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]

Expected Output

#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']

The above sentences are from the following two articles about hyphen and dash.

Problem

  1. The first process to get rid of the symbol '-' was failed, and it is difficult to understand the reason why the second and third sentence were combined without single quotation ('').
#output
['A former employee of the accused company, — — —, offered a statement off the record.', 
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
  1. I have no idea how I can write code to distinguish hyphen and dash.

Current Code

samples = [
    'A former employee of the accused company, — — —, offered a statement off the record.', #dash
    'He is afraid of two things—spiders and senior prom.' #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]

ignore_symbol = ['-']
for i in range(len(samples)):
    text = samples[i]
    ret = []
    for word in text.split(' '):
        ignore = len(word) <= 0 
        for iw in ignore_symbol:
            if word == iw:
                ignore = True
                break
        if not ignore:
            ret.append(word)

    text = ' '.join(ret)
    samples[i] = text
print(samples)

#output
['A former employee of the accused company, — — —, offered a statement off the record.', 
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']

for i in range (len(samples)):
    list_temp = []
    text = samples[i]
    list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
    samples[i] = list_temp
print(samples)

#output
[['A former employee of the accused company',
  '— — —',
  'offered a statement off the record.'],
 ['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
  'fifty-six bottles of pop.']]

Develop Environment

Python 3.7.0

Upvotes: 0

Views: 486

Answers (3)

Puneet Singh
Puneet Singh

Reputation: 334

First of all, the 2nd and 3rd sentence were combined because there is no comma separating both the strings. In Python, writing tmp = 'a''b' is equivalent to tmp = 'ab', which is why there are just 2 strings in the samples (2nd and 3rd got merged).

Onto to your question: Function remove_dash_preserve_hyphen below removes all dashes in the str_sentence parameter and returns a cleaned str_sentence. The function is then applied to all string elements in the samples list, thus generating the clean samples_without_dash.

samples = [
    'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
    'He is afraid of two things — spiders and senior prom.',#**(COMMA HERE)** #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]

def remove_dash_preserve_hyphen(str_sentence, dash_signatures=['—']):
    for dash_sig in dash_signatures:
        str_sentence = str_sentence.replace(dash_sig, '')
    return str_sentence

samples_without_dash = [remove_dash_preserve_hyphen(sentence) for sentence in samples]

The exact dash in question is the 'em-dash' with unicode 'U+2014'. It could be possible that there's more kind of dashes in the samples, that you don't want. You need to track it down sample wise, and pass the list of all dash types (the ones you don't want) in dash_signatures parameter while calling the remove_dash_preserve_hyphen function.

Upvotes: 0

Sayandip Dutta
Sayandip Dutta

Reputation: 15872

If you are looking for non-regex solution, Unicode point for dash is 8212, so you can replace those with ',', then split by ',' and then add non-whitespace sentences:

>>> samples = [
    'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
    'He is afraid of two things — spiders and senior prom.', #dash
    'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
>>> output = [[
               sentence.strip() for sentence in elem.replace(chr(8212), ',').split(',') 
               if sentence.strip()
              ] for elem in samples]
>>> output
[['A former employee of the accused company',
  'offered a statement off the record.'],
 ['He is afraid of two things', 'spiders and senior prom.'],
 ['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']]

Upvotes: 1

dzimney
dzimney

Reputation: 552

Trying using a regex (regular expression) split with re.split. Python's String.split() is too limited in functionality for this. You'll then need to pass the Unicode version of a "hyphen" character.

Something like:

re.split('[\002D]', text)

Upvotes: 0

Related Questions