Reputation:
I would like to get rid of not hyphen but dash in sentences for NLP preprocessing.
Input
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
Expected Output
#output
['A former employee of the accused company','offered a statement off the record.']
['He is afraid of two things', 'spiders and senior prom.']
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']
The above sentences are from the following two articles about hyphen and dash.
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
samples = [
'A former employee of the accused company, — — —, offered a statement off the record.', #dash
'He is afraid of two things—spiders and senior prom.' #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
ignore_symbol = ['-']
for i in range(len(samples)):
text = samples[i]
ret = []
for word in text.split(' '):
ignore = len(word) <= 0
for iw in ignore_symbol:
if word == iw:
ignore = True
break
if not ignore:
ret.append(word)
text = ' '.join(ret)
samples[i] = text
print(samples)
#output
['A former employee of the accused company, — — —, offered a statement off the record.',
'He is afraid of two things—spiders and senior prom.
Fifty-six bottles of pop on the wall, fifty-six bottles of pop.']
for i in range (len(samples)):
list_temp = []
text = samples[i]
list_temp.extend([x.strip() for x in text.split(',') if not x.strip() == ''])
samples[i] = list_temp
print(samples)
#output
[['A former employee of the accused company',
'— — —',
'offered a statement off the record.'],
['He is afraid of two things—spiders and senior prom.Fifty-six bottles of pop on the wall',
'fifty-six bottles of pop.']]
Python 3.7.0
Upvotes: 0
Views: 486
Reputation: 334
First of all, the 2nd and 3rd sentence were combined because there is no comma separating both the strings. In Python, writing tmp = 'a''b'
is equivalent to tmp = 'ab'
, which is why there are just 2 strings in the samples
(2nd and 3rd got merged).
Onto to your question:
Function remove_dash_preserve_hyphen
below removes all dashes in the str_sentence
parameter and returns a cleaned str_sentence
.
The function is then applied to all string elements in the samples
list, thus generating the clean samples_without_dash
.
samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.',#**(COMMA HERE)** #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
def remove_dash_preserve_hyphen(str_sentence, dash_signatures=['—']):
for dash_sig in dash_signatures:
str_sentence = str_sentence.replace(dash_sig, '')
return str_sentence
samples_without_dash = [remove_dash_preserve_hyphen(sentence) for sentence in samples]
The exact dash in question is the 'em-dash' with unicode 'U+2014'.
It could be possible that there's more kind of dashes in the samples, that you don't want. You need to track it down sample wise, and pass the list of all dash types (the ones you don't want) in dash_signatures
parameter while calling the remove_dash_preserve_hyphen
function.
Upvotes: 0
Reputation: 15872
If you are looking for non-regex solution, Unicode point for dash is 8212
, so you can replace those with ','
, then split by ','
and then add non-whitespace sentences:
>>> samples = [
'A former employee of the accused company, ———, offered a statement off the record.', #three dashes
'He is afraid of two things — spiders and senior prom.', #dash
'Fifty-six bottles of pop on the wall, fifty-six bottles of pop.' #hyphen
]
>>> output = [[
sentence.strip() for sentence in elem.replace(chr(8212), ',').split(',')
if sentence.strip()
] for elem in samples]
>>> output
[['A former employee of the accused company',
'offered a statement off the record.'],
['He is afraid of two things', 'spiders and senior prom.'],
['Fifty-six bottles of pop on the wall', 'fifty-six bottles of pop.']]
Upvotes: 1
Reputation: 552
Trying using a regex (regular expression) split with re.split
. Python's String.split() is too limited in functionality for this. You'll then need to pass the Unicode version of a "hyphen" character.
Something like:
re.split('[\002D]', text)
Upvotes: 0