Reputation: 33233

removing string based on start index and end index

So, I have bunch of long strings hence thinking of an efficient way to do this operation Suppose I have a string something like

 "< stuff to remove> get this stuff <stuff to remove>

So, I am trying to extract "get this stuff"

So I am writing something like this.

 strt_pos = 0
  end_pos = 0
 while True:
   strt_idx = string.find(start_point, strt_pos) # start_point = "<" in our example
   end_idx  = string.find(end_point, end_pos)   # end_point = ">" in our example
   chunk_to_remove = string[strt_idx:end_idx]
    # Now how do i chop this part off from the string??
   strt_pos = strt_pos + 1
    end_pos = end_pos + 1
   if str_pos >= len(string) # or maybe end_pos >= len(string):
      break

What is the better way to implement this

Upvotes: 1

Answers (4)

Andrew Clark

Reputation: 208465

Regular expressions would be a simple way to do this (although not necessarily faster as shown by jedwards' answer):

import re
s = '< stuff to remove> get this stuff <stuff to remove>'
s = re.sub(r'<[^>]*>', '', s)

After this s would be the string ' get this stuff '.

Upvotes: 2

jedwards

Reputation: 30210

If you have the starting and ending index of the string, you could do something like:

substring = string[s_ind:e_ind]

Where s_ind is the index of the first character you want to include in the string and e_ind is the index of the first character you don't want in the string.

For example

string = "Long string of which I only want a small part"
#         012345678901234567890123456789012345678901234
#         0         1         2         3
substring = string[21:32]
print substring

prints I only want

You could find the indices in the same manner you are now.

Edit: Regarding efficiency, this type of solution is actually more efficient than the regex solution. The reason is there is a lot of overhead involved in regular expressions that you don't necessarily need.

I encourage you to test these things for yourself instead of blindly going on what people claim is most efficient.

Consider the following test program:

#!/bin/env python

import re
import time

def inner_regex(s):
    return re.sub(r'<[^>]*>', '', s)

def inner_substr(s):
    s_ind = s.find('>') + 1
    e_ind = s.find('<', s_ind)
    return s[s_ind:e_ind]


s = '<stuff to remove> get this stuff <stuff to remove>'

tr1 = time.time()
for i in range(100000):
    s1 = inner_regex(s)
tr2 = time.time()
print("Regex:     %f" % (tr2 - tr1))

ts1 = time.time()
for i in range(100000):
    s2 = inner_substr(s)
ts2 = time.time()
print("Substring: %f" % (ts2 - ts1))

the output is:

Regex:     0.511443
Substring: 0.148062

In other words, using the regex approach you are more than 3x slower than your original, corrected approach.

Edit: Regarding the comment about compiled regex, it is faster than uncompiled regex, but still slower than the explicit substring:

#!/bin/env python

import re
import time

def inner_regex(s):
    return re.sub(r'<[^>]*>', '', s)

def inner_regex_compiled(s,r):
    return r.sub('', s)

def inner_substr(s):
    s_ind = s.find('>') + 1
    e_ind = s.find('<', s_ind)
    return s[s_ind:e_ind]


s = '<stuff to remove> get this stuff <stuff to remove>'


tr1 = time.time()
for i in range(100000):
    s1 = inner_regex(s)
tr2 = time.time()


tc1 = time.time()
r = re.compile(r'<[^>]*>')
for i in range(100000):
    s2 = inner_regex_compiled(s,r)
tc2 = time.time()


ts1 = time.time()
for i in range(100000):
    s3 = inner_substr(s)
ts2 = time.time()


print("Regex:          %f" % (tr2 - tr1))
print("Regex Compiled: %f" % (tc2 - tc1))
print("Substring:      %f" % (ts2 - ts1))

Returns:

Regex:          0.512799  # >3 times slower
Regex Compiled: 0.297863  # ~2 times slower
Substring:      0.144910

Moral of the story: While regular expressions are a helpful tool to have in the toolbox, they're simply not as efficient as more straightforward ways when available.

And don't take people's word for things that you can easily test yourself.

Upvotes: 0

octern

Reputation: 4868

I'm not sure whether the search operation you're doing is part of the question. If you're just saying that you have a start index and an end index and you want to remove those characters from a string, you don't need a special function for that. Python lets you use numeric indices for the characters in strings.

> x="abcdefg"
> x[1:3]
'bc'

The operation you want to perform would be something like x[:strt_idx] + x[end_idx:] . (if you omit the first argument it means "start from the beginning" and if you omit the second one it means "continue to the end".)

Upvotes: 2

dhg

Reputation: 52681

Use a regular expression:

>>> s = "< stuff to remove> get this stuff <stuff to remove>"
>>> import re
>>> re.sub(r'<[^<>]*>', '', s)
' get this stuff '

The expression <[^<>]*> matches strings that start with <, end with >, and have neither < or > in between. The sub command then replaces the match with the empty string, thus deleting it.

You can then call .strip() on the result to remove the leading and trailing spaces if you want.

Of course, this will fail when you have, for example, nested tags, but it will work for your example.

Upvotes: 2

removing string based on start index and end index

Answers (4)

Related Questions