TheHumanSpider
TheHumanSpider

Reputation: 73

How to preserve only the final occurrence of a character in a sentence?

I am trying to split the sentence into two columns (Review and Sentiment).

Let us assume that we have a sentence:

Hi... I earn 7 dot 50 per hour i.e $7.50/hr. Positive

Here, "Positive" is the Sentiment and the former is the Review.

i) I cannot use \s as delimiter to split the sentence into two columns(Review,Sentiment) ii) If I use '.' as delimiter then multiple occurrences of '.' is present in the sentence.

I have written a code to remove the multiple occurrences of '.' and the code is as below:

def clean(sentence):
  clear = re.sub(r"[,|\"|\"|\'|\'|\-|!|?|\/|*|:|\\|\(|\)|;|$]",'', sentence)
  clear1 = re.sub(r'(\W)\1+',' ', clear)
  [' '.join(clear1.split())]

which is able to remove "..." after the word "hi" but fails for "i.e" and "$7.50".

My desired result is:

Review: Hi I earn 7 dot 50 per hour i e 7 50 hr
Sentiment: Positive

My output is:

Hi I earn 7 dot 50 per hour i.e 7.50 hr.

PS: I am using pandas to load it as a dataframe of two columns

Edit1: My sentiment contains either "Positive" or "Negative" in my case.
Edit2: I am storing this output as a csv file and I am reading using pandas(read_csv())

Upvotes: 1

Views: 83

Answers (5)

Casimir et Hippolyte
Casimir et Hippolyte

Reputation: 89557

Find all groups of word characters and use the lists:

>>> import re
>>> l = re.findall(r'\w+', s)
>>> ' '.join(l[:-1])
'Hi I earn 7 dot 50 per hour i e 7 50 hr'
>>> l[-1]
'Positive'

Upvotes: 1

Florian. C
Florian. C

Reputation: 108

In your case, as you know that the sentiment will always be "Positive" or "Negative" you can get your 2 columns like this :

m = re.match(r"(?P<review>.*)\. (?P<sentiment>Positive|Negative)$", sentence)
m.group('review')
m.group('sentiment')

Upvotes: 0

Toto
Toto

Reputation: 91430

How about re.split?

This will split on space only if it is followed by Positive or Negative

import re

sentence = 'Hi... I earn 7 dot 50 per hour i.e $7.50/hr. Positive'
res = re.split(r'\s+(?=Positive|Negative)', sentence)
print(res)

Output:

['Hi... I earn 7 dot 50 per hour i.e $7.50/hr.', 'Positive']

Upvotes: 0

deoomen
deoomen

Reputation: 199

If you just need the last occurrence of dot sign, you can use this regex:

\.(?!.*\.)

Example: https://regex101.com/r/OYkupF/2

Upvotes: 1

Raunaq Jain
Raunaq Jain

Reputation: 917

If sentiment is only 'Positive' or 'Negative'. Then,

def clean(sentence):
    tokens = sentence.split()
    return " ".join(tokens[:-1]), tokens[-1]

which will give a tuple,

('Hi... I earn 7 dot 50 per hour i.e $7.50/hr.', 'Positive')

Upvotes: 0

Related Questions