zihan meng
zihan meng

Reputation: 125

How to remove characters with special strings using regular expression in Python?

I am trying to clean up a log and I want to remove some special strings

Example:

%/h >  %/h Current value over threshold value
Pg/S >  Pg/S Current value over threshold value
Pg/S >  Pg/S  No. of pages paged in exceeds threshold
MB <  MB   min. avg. value over threshold value

I have tried to use some patterns but it seems not to work.

re.sub(r'\w\w\/\s>\s\w','',text)

Is there any good idea for me to remove the special pattern?

I want to remove the .../...>.../...

I expect my output to only contain useful words.

   Current value over threshold value
   No. of pages paged in exceeds threshold
   min. avg. value over threshold value

Thank you for any idea!

Upvotes: 1

Views: 150

Answers (3)

Ibrahim
Ibrahim

Reputation: 6098

This is a relatively long regex, but it gets the job done.

[%\w][\/\w]\/?[\/\s\w]\s?\<?\>?\s\s[\w%]\/?[a-zA-Z%]\/?[\w]?\s\s?\s?

Demo: https://regex101.com/r/ayh19b/4

Or you can do something like:

^[\s\S]*?(?=\w\w(?:\w|\.))

Demo: https://regex101.com/r/ayh19b/6

Upvotes: 1

ekhumoro
ekhumoro

Reputation: 120768

Assuming the structure of the file is:

[special-string] [< or >] [special-string] [message]

then this should work:

>>> rgx = re.compile(r'^[^<>]+[<>] +\S+ +', re.M)
>>>
>>> s = """
... %/h >  %/h Current value over threshold value
... Pg/S >  Pg/S Current value over threshold value
... Pg/S >  Pg/S  No. of pages paged in exceeds threshold
... MB <  MB   min. avg. value over threshold value
... """
>>>
>>> print(rgx.sub('', s))
Current value over threshold value
Current value over threshold value
No. of pages paged in exceeds threshold
min. avg. value over threshold value

Upvotes: 3

idjaw
idjaw

Reputation: 26600

Based on the pattern you are trying to match on, it seems like you always know where the string is positioned. You can actually do this without regex, and just make use of split and slicing to get the section of interest. Finally, use join to bring back in to a string, for your final result.

The below result will do the following:

s.split() - split on space creating a list where each words will be an entry in the list

[3:] - slice the list by taking everything from the fourth position (0 indexing)

' '.join() - Will convert back to a string, placing a space between each element from the list

Demo:

s = "%/h >  %/h Current value over threshold value"
res = ' '.join(s.split()[3:])

Output:

Current value over threshold value

Upvotes: 3

Related Questions