Toctave
Toctave

Reputation: 149

Making only specific parts of a string uppercase

For a project of mine, I am parsing reddit comments and trying to make them uppercase. It works perfectly well with plain text, however with things such as links it creates bugs, making the characters in the link uppercase and therefore making the link unusable. Now links on reddit have very simple syntax, namely [text](url). I would like to output [TEXT](url) knowing my comment will most probably have some other content.

I tried using regexes, but this doesn't work while I still think I'm pretty close :

import re

def make_upper(comment):
    return re.sub(r"\[([^ ]*)\]\(([^ ]*)\)", r"[\1]".upper() + r"(\2)", comment)

That regex is a bit messy (like all regexes ;) but still readable to me and I wouldn't mind explaining it if need be.

Input:

"blablabla (btw, blargh!) [text](url) blabla"

Desired output:

"BLABLABLA (BTW, BLARGH!) [TEXT](url) BLABLA"

Upvotes: 4

Views: 770

Answers (3)

Peter Brittain
Peter Brittain

Reputation: 13619

The previous answers rely on global search and replace to handle the solution. This is flawed for the case where the URL for a link is also contained in the rest of the text. A guaranteed safe way to do it is to construct the new string as you go along.

For example, while this isn't elegant, it does the trick:

MARKDOWN_LINK_REGEX = re.compile(r'(.*?)(\[.*?\])(\(.*?\))(.*)')

def uppercase_comment(text):
    result = ""
    while True:
        match = MARKDOWN_LINK_REGEX.match(text)
        if match:
            result += match.group(1).upper() + match.group(2).upper()
            result += match.group(3)
            text = match.group(4)
        else:
            result += text.upper()
            return result

Upvotes: 2

Two-Bit Alchemist
Two-Bit Alchemist

Reputation: 18467

This is the approach I took after some experimenting. I'm sure people will downvote me and let me know in comments if I've overlooked something just like I did to them.

Step 1: The Regex

You were almost right here, but I would make it non-greedy so that it doesn't treat two links separated by non-linked text as one.

MARKDOWN_LINK_REGEX = re.compile(r'\[(.*?)\]\((.*?)\)')

Also, at least for the part in square brackets, don't assume there won't be spaces. People can link as many words as they want, even whole sentences.

Step 2: The Function

def uppercase_comment(comment):
    result = comment.upper()
    for match in re.finditer(MARKDOWN_LINK_REGEX, comment):
        upper_match = match.group(0).upper()
        corrected_link = upper_match.replace(match.group(2).upper(),
                                             match.group(2))
        result = result.replace(upper_match, corrected_link)
    return result

First, make the whole comment uppercase, then make corrections. For the corrections, I loop over all the non-overlapping matches the regex above produces (re.finditer), fix the link by lowercasing what's in parentheses, then make that replacement in the original comment string. As was pointed out in the comments, lowercasing a link will not always work -- the original case must be preserved. In the updated function, we loop over matches in the original comment (and thus the original case) to generate the replacement, adding it to the uppercase string we created at the beginning.

Step 3: Testing it Out

if __name__ == '__main__':
    print(uppercase_comment('this comment has zero links'))
    print(uppercase_comment('this comment has [a link at the end](Google.com)'))
    print(uppercase_comment('[there is a link](Google.com) at the beginning'))
    print(uppercase_comment('there is [a link](Google.com) in the middle'))
    print(uppercase_comment('there are [a few](Google.com) links [in](StackOverflow.com) this one'))
    print(uppercase_comment("doesn't matter (it shouldn't!) if there are [extra parens](Google.com)"))

Produces

THIS COMMENT HAS ZERO LINKS
THIS COMMENT HAS [A LINK AT THE END](Google.com)
[THERE IS A LINK](Google.com) AT THE BEGINNING
THERE IS [A LINK](Google.com) IN THE MIDDLE
THERE ARE [A FEW](Google.com) LINKS [IN](StackOverflow.com) THIS ONE
DOESN'T MATTER (IT SHOULDN'T!) IF THERE ARE [EXTRA PARENS](Google.com)

Please comment with feedback and fixes. Note mine is in Python 3 but the only 3-specific thing I'm using is the print function.

Upvotes: 2

Toctave
Toctave

Reputation: 149

Thanks to all of you, here's a combined version of all of your suggestions, that does everything I want it to :

def make_upper(comment):
    native_links = re.findall(r"\[.*?\]\((.*?)\)", comment)
    result = comment.upper()
    for url in native_links:
        result = result.replace(url.upper(), url)
return result

I'm not quite sure it's the fastest way but in the final algorithm this code shouldn't be used too often so that'll do, thanks again!

Upvotes: 2

Related Questions