Reputation: 33233
So, I have bunch of long strings hence thinking of an efficient way to do this operation Suppose I have a string something like
"< stuff to remove> get this stuff <stuff to remove>
So, I am trying to extract "get this stuff"
So I am writing something like this.
strt_pos = 0
end_pos = 0
while True:
strt_idx = string.find(start_point, strt_pos) # start_point = "<" in our example
end_idx = string.find(end_point, end_pos) # end_point = ">" in our example
chunk_to_remove = string[strt_idx:end_idx]
# Now how do i chop this part off from the string??
strt_pos = strt_pos + 1
end_pos = end_pos + 1
if str_pos >= len(string) # or maybe end_pos >= len(string):
break
What is the better way to implement this
Upvotes: 1
Views: 3425
Reputation: 208465
Regular expressions would be a simple way to do this (although not necessarily faster as shown by jedwards' answer):
import re
s = '< stuff to remove> get this stuff <stuff to remove>'
s = re.sub(r'<[^>]*>', '', s)
After this s
would be the string ' get this stuff '
.
Upvotes: 2
Reputation: 30210
If you have the starting and ending index of the string, you could do something like:
substring = string[s_ind:e_ind]
Where s_ind
is the index of the first character you want to include in the string and e_ind
is the index of the first character you don't want in the string.
For example
string = "Long string of which I only want a small part"
# 012345678901234567890123456789012345678901234
# 0 1 2 3
substring = string[21:32]
print substring
prints I only want
You could find the indices in the same manner you are now.
Edit: Regarding efficiency, this type of solution is actually more efficient than the regex solution. The reason is there is a lot of overhead involved in regular expressions that you don't necessarily need.
I encourage you to test these things for yourself instead of blindly going on what people claim is most efficient.
Consider the following test program:
#!/bin/env python
import re
import time
def inner_regex(s):
return re.sub(r'<[^>]*>', '', s)
def inner_substr(s):
s_ind = s.find('>') + 1
e_ind = s.find('<', s_ind)
return s[s_ind:e_ind]
s = '<stuff to remove> get this stuff <stuff to remove>'
tr1 = time.time()
for i in range(100000):
s1 = inner_regex(s)
tr2 = time.time()
print("Regex: %f" % (tr2 - tr1))
ts1 = time.time()
for i in range(100000):
s2 = inner_substr(s)
ts2 = time.time()
print("Substring: %f" % (ts2 - ts1))
the output is:
Regex: 0.511443
Substring: 0.148062
In other words, using the regex approach you are more than 3x slower than your original, corrected approach.
Edit: Regarding the comment about compiled regex, it is faster than uncompiled regex, but still slower than the explicit substring:
#!/bin/env python
import re
import time
def inner_regex(s):
return re.sub(r'<[^>]*>', '', s)
def inner_regex_compiled(s,r):
return r.sub('', s)
def inner_substr(s):
s_ind = s.find('>') + 1
e_ind = s.find('<', s_ind)
return s[s_ind:e_ind]
s = '<stuff to remove> get this stuff <stuff to remove>'
tr1 = time.time()
for i in range(100000):
s1 = inner_regex(s)
tr2 = time.time()
tc1 = time.time()
r = re.compile(r'<[^>]*>')
for i in range(100000):
s2 = inner_regex_compiled(s,r)
tc2 = time.time()
ts1 = time.time()
for i in range(100000):
s3 = inner_substr(s)
ts2 = time.time()
print("Regex: %f" % (tr2 - tr1))
print("Regex Compiled: %f" % (tc2 - tc1))
print("Substring: %f" % (ts2 - ts1))
Returns:
Regex: 0.512799 # >3 times slower
Regex Compiled: 0.297863 # ~2 times slower
Substring: 0.144910
Moral of the story: While regular expressions are a helpful tool to have in the toolbox, they're simply not as efficient as more straightforward ways when available.
And don't take people's word for things that you can easily test yourself.
Upvotes: 0
Reputation: 4868
I'm not sure whether the search operation you're doing is part of the question. If you're just saying that you have a start index and an end index and you want to remove those characters from a string, you don't need a special function for that. Python lets you use numeric indices for the characters in strings.
> x="abcdefg"
> x[1:3]
'bc'
The operation you want to perform would be something like x[:strt_idx] + x[end_idx:]
. (if you omit the first argument it means "start from the beginning" and if you omit the second one it means "continue to the end".)
Upvotes: 2
Reputation: 52681
Use a regular expression:
>>> s = "< stuff to remove> get this stuff <stuff to remove>"
>>> import re
>>> re.sub(r'<[^<>]*>', '', s)
' get this stuff '
The expression <[^<>]*>
matches strings that start with <
, end with >
, and have neither <
or >
in between. The sub
command then replaces the match with the empty string, thus deleting it.
You can then call .strip()
on the result to remove the leading and trailing spaces if you want.
Of course, this will fail when you have, for example, nested tags, but it will work for your example.
Upvotes: 2