GuillaumeA
GuillaumeA

Reputation: 3545

Extract text before two underscores with regex

I feel I am not far from a solution but I still struggle to extract some text from variables with Regex. The conditions are:

Examples:

test_TEST_TEST_1_TEST_13DAHA bfd     -->   TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd    -->   TEST_TEST_1_TEST
test__TEST_TEST                      -->   TEST_TEST
test_TEST__DHJF                      -->   TEST
test_TEST__Ddsa                      -->   TEST
test__TEST                           -->   TEST

So far, I got

_([0-9A-Z_]+) works for the first one but not the second
_([0-9A-Z_]+)(?:__.*)+ works for the second one but not the first

Upvotes: 0

Views: 59

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626845

You say that you don't want the string after the two underscores if some string already matched before, i.e. you may consume and omit all the letters/digits after double or more underscores after capturing the expected pattern:

_([0-9A-Z]+(?:_[0-9A-Z]+)*)(?:__[0-9A-Z_]+)?

See the regex demo. The (?:__[0-9A-Z_]+)? part "excludes" trying to match the text that comes right after your match.

You need to get Group 1 values rather than the whole match values, hence re.findall suits best here, especially if you expect multiple matches. If you expect a single match per string, use re.search.

See the Python demo:

import re
cases = ["test_TEST bfd",
    "test_TEST_TEST_1_TEST_13DAHA bfd",
    "test_TEST_TEST_1_TEST__13DAHA bfd",
    "test__TEST_TEST",
    "test_TEST__DHJF",
    "test_TEST__Ddsa"
]
pattern = re.compile(r'_([0-9A-Z]+(?:_[0-9A-Z]+)*)(?:__[0-9A-Z_]+)?')
for case in cases:
    matches = pattern.findall(case)
    print(matches)

With re.search:

for case in cases:
    match = pattern.search(case)
    if match:
        print(match.group(1))

Output:

TEST
TEST_TEST_1_TEST_13DAHA
TEST_TEST_1_TEST
TEST_TEST
TEST
TEST

Upvotes: 1

Tim Biegeleisen
Tim Biegeleisen

Reputation: 521259

Use this pattern:

[A-Z0-9]+(?=_)(?:_[A-Z0-9]+)*

Sample script:

inp = ["test_TEST_TEST_1_TEST_13DAHA bfd", "test_TEST_TEST_1_TEST__13DAHA bfd", "test__TEST_TEST", "test_TEST__DHJF"]
for i in inp:
    matches = re.findall(r'[A-Z0-9]+(?=_)(?:_[A-Z0-9]+)*', i)
    print(i + " => " + matches[0])

This prints:

test_TEST_TEST_1_TEST_13DAHA bfd => TEST_TEST_1_TEST_13DAHA
test_TEST_TEST_1_TEST__13DAHA bfd => TEST_TEST_1_TEST
test__TEST_TEST => TEST_TEST
test_TEST__DHJF => TEST

Upvotes: 1

Related Questions