Reputation: 51
I have a string in the following format.
Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit.
How to delete Test 2, Test 3 and so on, so the string would look like this?
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
I have tried:
test1 = re.compile(r'^Test \d ')
test2 = re.compile(r'^Test \d\d ')
text = re.sub(test1, '', text)
text = re.sub(test2, '', text)
But it didn't work
Upvotes: 3
Views: 109
Reputation: 133428
Based on your shown samples, please try following. This will work even if you are having 1 or more occurrences of Test digit
from starting of your value.
import re
var="""Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit."""
print (re.sub(r'^(Test\s+\d+)(\s+Test\s+\d+)*\s*', '', var, flags=re.M))
Explanation: Using Python's re
library here. Then using re.sub
function of Python. Giving regex inside it to substitute matched value with NULL in var(variable).
Explanation of regex:
^(Test\s+\d+) ##From starting of value, matching Test followed by 1 or more spaces followed by 1 or more digits.
(\s+Test\s+\d+)* ##Matching 1 or more spaces followed by Test, followed by 1 or more spaces, followed by 1 or more occurrences of digits. matching 0 or more occurrences of this regex.
\s* ##Matching 0 or more occurrences of spaces here.
Upvotes: 3
Reputation: 23142
Assuming that you have a single multi-line string, then
test1 = re.compile(r'^Test \d ') text = re.sub(test1, '', text)
does in fact remove Test 2
from the first line of the string, but does not change all other lines, because ^
matches the beginning of the whole string, and not the beginning of each line.
You can change that by using the re.M
flag:
When specified, the pattern character
'^'
matches at the beginning of the string and at the beginning of each line
>>> test1 = re.compile(r'^Test \d ', flags=re.M)
>>> text = '''\
... Test 2 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 3 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 4 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... Test 5 Lorem ipsum dolor sit amet consectetur adipisicing elit.
... '''
>>> print(re.sub(test1, '', text))
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Lorem ipsum dolor sit amet consectetur adipisicing elit.
Alternatively, split the string in into lines and apply your original pattern without re.M
to each line separately:
>>> test1 = re.compile(r'^Test \d ')
>>> [re.sub(test1, '', line) for line in text.splitlines()]
['Lorem ipsum dolor sit amet consectetur adipisicing elit.',
'Lorem ipsum dolor sit amet consectetur adipisicing elit.',
'Lorem ipsum dolor sit amet consectetur adipisicing elit.',
'Lorem ipsum dolor sit amet consectetur adipisicing elit.']
Depending on whether you want to continue processing the text as a whole, or each line separately (or maybe you already have each line separately as input to your program), one or the other option may be more practical.
The test1
pattern works only for single-digit numbers after 'Test '
and the test2
pattern works only for two-digit numbers. To make it work for any number of digits, change \d
or \d\d
to \d+
.
Upvotes: 3