Reputation: 282
I have a file in which all the lines are in the format title - news_source
. I want to substitute all the characters after the title for
(whitespace).
So far I only have the pattern as \s-\s
but don't know what pattern to write for the news_source.
Can somebody guide me through the process of writing the regex for the news_source. Thanks!
Upvotes: 0
Views: 80
Reputation: 163632
You can match \s-\s.*
and replace with an empty string.
The \s
can also match a newline. If you want to match whitespace characters without newlines, you can also use [^\S\r\n]-[^\S\r\n].*
import re
s = ("title - news_source\n"
"Airbnb stock has 15% upside after an 'impressive' earning report, says BofA - Business Insider")
result = print(re.sub(r"\s-\s.*", " ", s))
Output
title
Airbnb stock has 15% upside after an 'impressive' earning report, says BofA
If there should be at least a single non whitespace char \S
at the start of the string, you can use a capture group and use the group followed by a space in the replacement.
re.sub(r"^(\S.*)[^\S\r\n]-[^\S\r\n].*", r"\1 ", s)
Upvotes: 2
Reputation: 134
If you want to only match news_source
, you can do the following:
\w+_\w+
So the regex will match any string that contains:
at least one alphanumeric character, followed by an underscore, followed by at least one aplhanumeric character.
I guess however, that it will not always contain an underscore. If you simply want to match anything behind the '-', but only want to get the thing after the space, you can create a capture group:
\-\s(w+)
This will match anything after the -
, and capture all alphanumeric characters, if there is at least one alphanumeric character!
In your case, it would match with - news_source
, and capture news_source
.
But if it were a more complicated string such as: Title - new source _ with : some , very weird "format"
and you really want to get everything after the -
, then you would use:
\-\s(.+)
Which would capture: new source _ with : some , very weird "format"
.
Where the .
will match any character, except for a line break (\n
and \r
).
I'm not sure what exactly you are using to evaluate regex expressions in python, but you should check how you can extract the capture group from a match.
After your reaction, I now see that you want to simply get rid of the source. That's my bad!
In that case:
(.+)\s-
This will capture the title (everything before -
).
I hope the explanation of all the above expressions is enough to understand what this one exactly does. In short: It captures everything before the pattern -
.
I will leave the rest of the examples in here as well unless people want me to remove them for clarity.
Upvotes: 2