Reputation: 4995
I have strings that are of the form below:
<p>The is a string.</p>
<em>This is another string.</em>
They are read in from a text file one line at a time. I want to separate these into words. For that I am just splitting the string using split()
.
Now I have a set of words but the first word will be <p>The
rather than The
. Same for the other words that have <>
next to them. I want to remove the <..>
from the words.
I'd like to do this in one line. What I mean is I want to pass as a parameter something of the form <*>
like I would on the command line. I was thinking of using the replace()
function to try to do this, but I am not sure how the replace()
function parameter would look like.
For example, how could I change <..>
below in a way that it will mean that I want to include anything that is between <
and >
:
x = x.replace("<..>", "")
Upvotes: 0
Views: 1532
Reputation: 41838
You don't need to 1. Split then 2. Replace. The two solutions below show you how to do it with one single step.
Match All and Split are Two Sides of the Same Coin, and in this case it is safer to match all:
<[^>]+>|(\w+)
The words will be in Group 1.
Use it like this:
subject = '<p>The is a string.</p><em>This is another string.</em>'
regex = re.compile(r'<[^>]+>|(\w+)')
matches = [group for group in re.findall(regex, subject) if group]
print(matches)
Output
['The', 'is', 'a', 'string', 'This', 'is', 'another', 'string']
Discussion
This problem is a classic case of the technique explained in this question to "regex-match a pattern, excluding..."
The left side of the alternation |
matches complete <tags>
. We will ignore these matches. The right side matches and captures words to Group 1, and we know they are the right ones because they were not matched by the expression on the left.
Reference
<[^>]+>|[ .]
On the left side of the |
, we use <complete tags>
as a split delimiter. On the right side, we use a space character or a period.
Output
This
is
a
string
Upvotes: 2
Reputation:
Unfortunately, str.replace
does not support Regex patterns. You need to use re.sub
for this:
>>> from re import sub
>>> sub("<[^>]*>", "", "<p>The is a string.</p>")
'The is a string.'
>>> sub("<[^>]*>", "", "<em>This is another string.</em>")
'This is another string.'
>>>
[^>]*
matches zero or more characters that are not >
.
Upvotes: 3