Reputation: 7404
I'm cleaning some text from Reddit. When you include a link in a Reddit self-text, you do so like this:
[the text you read](https://website.com/to/go/to)
. I'd like to use regex to remove the hyperlink (e.g. https://website.com/to/go/to
) but keep the text you read
.
Here is another example:
[the podcast list](https://www.reddit.com/r/datascience/wiki/podcasts)
I'd like to keep: the podcast list
.
How can I do this with Python's re
library? What is the appropriate regex?
Upvotes: 6
Views: 2440
Reputation: 394
I have created an initial attempt at your requested regex:
(?<=\[.+\])\(.+\)
The first part (?<=...)
is a look behind, which means it looks for it but does not match it. You can use this regex along with re's method sub. You can also see the meanings of all the regex symbols here.
You can extend the above regex to look for only things that have weblinks in the brackets, like so:
(?<=\[.+\])\(https?:\/\/.+\)
The problem with this is that if the link they provide is not started with an http or https it will fail.
After this you will need to remove the square brackets, maybe just removing all square brackets works fine for you.
Valentino pointed out that substitute accepts capturing groups, which lets you capture the text and substitute the text back in using the following regex:
\[(.+)\]\(.+\)
You can then substitute the first captured group (in the square brackets) back in using:
re.sub(r"\[(.+)\]\(.+\)", r"\1", original_text)
If you want to look at the regex in more detail (if you're new to regex or want to learn what they mean) I would recommend an online regex interpreter, they explain what each symbol does and it makes it much easier to read (especially when there are lots of escaped symbols like there are here).
Upvotes: 8