Reputation: 149
For a project of mine, I am parsing reddit comments and trying to make them uppercase. It works perfectly well with plain text, however with things such as links it creates bugs, making the characters in the link uppercase and therefore making the link unusable. Now links on reddit have very simple syntax, namely [text](url)
. I would like to output [TEXT](url)
knowing my comment will most probably have some other content.
I tried using regexes, but this doesn't work while I still think I'm pretty close :
import re
def make_upper(comment):
return re.sub(r"\[([^ ]*)\]\(([^ ]*)\)", r"[\1]".upper() + r"(\2)", comment)
That regex is a bit messy (like all regexes ;) but still readable to me and I wouldn't mind explaining it if need be.
Input:
"blablabla (btw, blargh!) [text](url) blabla"
Desired output:
"BLABLABLA (BTW, BLARGH!) [TEXT](url) BLABLA"
Upvotes: 4
Views: 770
Reputation: 13619
The previous answers rely on global search and replace to handle the solution. This is flawed for the case where the URL for a link is also contained in the rest of the text. A guaranteed safe way to do it is to construct the new string as you go along.
For example, while this isn't elegant, it does the trick:
MARKDOWN_LINK_REGEX = re.compile(r'(.*?)(\[.*?\])(\(.*?\))(.*)')
def uppercase_comment(text):
result = ""
while True:
match = MARKDOWN_LINK_REGEX.match(text)
if match:
result += match.group(1).upper() + match.group(2).upper()
result += match.group(3)
text = match.group(4)
else:
result += text.upper()
return result
Upvotes: 2
Reputation: 18467
This is the approach I took after some experimenting. I'm sure people will downvote me and let me know in comments if I've overlooked something just like I did to them.
You were almost right here, but I would make it non-greedy so that it doesn't treat two links separated by non-linked text as one.
MARKDOWN_LINK_REGEX = re.compile(r'\[(.*?)\]\((.*?)\)')
Also, at least for the part in square brackets, don't assume there won't be spaces. People can link as many words as they want, even whole sentences.
def uppercase_comment(comment):
result = comment.upper()
for match in re.finditer(MARKDOWN_LINK_REGEX, comment):
upper_match = match.group(0).upper()
corrected_link = upper_match.replace(match.group(2).upper(),
match.group(2))
result = result.replace(upper_match, corrected_link)
return result
First, make the whole comment uppercase, then make corrections. For the corrections, I loop over all the non-overlapping matches the regex above produces (re.finditer
), fix the link by lowercasing what's in parentheses, then make that replacement in the original comment string.
As was pointed out in the comments, lowercasing a link will not always work -- the original case must be preserved. In the updated function, we loop over matches in the original comment (and thus the original case) to generate the replacement, adding it to the uppercase string we created at the beginning.
if __name__ == '__main__':
print(uppercase_comment('this comment has zero links'))
print(uppercase_comment('this comment has [a link at the end](Google.com)'))
print(uppercase_comment('[there is a link](Google.com) at the beginning'))
print(uppercase_comment('there is [a link](Google.com) in the middle'))
print(uppercase_comment('there are [a few](Google.com) links [in](StackOverflow.com) this one'))
print(uppercase_comment("doesn't matter (it shouldn't!) if there are [extra parens](Google.com)"))
Produces
THIS COMMENT HAS ZERO LINKS
THIS COMMENT HAS [A LINK AT THE END](Google.com)
[THERE IS A LINK](Google.com) AT THE BEGINNING
THERE IS [A LINK](Google.com) IN THE MIDDLE
THERE ARE [A FEW](Google.com) LINKS [IN](StackOverflow.com) THIS ONE
DOESN'T MATTER (IT SHOULDN'T!) IF THERE ARE [EXTRA PARENS](Google.com)
Please comment with feedback and fixes. Note mine is in Python 3 but the only 3-specific thing I'm using is the print function.
Upvotes: 2
Reputation: 149
Thanks to all of you, here's a combined version of all of your suggestions, that does everything I want it to :
def make_upper(comment):
native_links = re.findall(r"\[.*?\]\((.*?)\)", comment)
result = comment.upper()
for url in native_links:
result = result.replace(url.upper(), url)
return result
I'm not quite sure it's the fastest way but in the final algorithm this code shouldn't be used too often so that'll do, thanks again!
Upvotes: 2