Reputation: 2481

How to extract specific strings using Python Regex

I have very challenging strings that I have been struggling.
For example,

str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

String starts with % integer and may have for or not, then following with name of pokemon. There might be comma(,) or & sign then new % integer. Finally there is another name of pokemon.(All start with capital case alphabet)
I want to extract two pokemons, for example result,

['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']

I can create a list of all pokemen then using in syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,

str_list = [str1, str2, str3, str4, str5]

for x in str_list:
    temp_list = []
    if 'for' in x:
        temp = x.split('% for', 1)[1].strip()
        temp_list.append(temp)
    else:
        temp = x.split(" ", 1)[1]
        temp_list.append(temp)
print(temp_list)

I know it is not regex express. The expression I tried is, \d+ to extract integer to start... but have no idea how to start.
EDIT2
@b_c has good edge case so, I am adding it here

edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'

result

['Pikachu', 'Pika Pika Pikachu']

Upvotes: 3

Answers (4)

Libra

Reputation: 2595

Use a positive lookbehind, this will work regardless of capitalization.

(?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+

EDIT: Changed it to work in Python.

Upvotes: 0

Matt M

Reputation: 710

An alternate method if you dont want to use regex and you don't want to rely on capitalization

def pokeFinder(strng):
    wordList = strng.split()
    pokeList = []
    for word in wordList:
        if not set('[~!@#$%^&*()_+{}":;\']+$').intersection(word) and 'for' not in word:
            pokeList.append(word.replace(',', ''))
    return pokeList

This won't add words with special chars. It also won't add words that are for. Then it removes commas from the found words.

A print of str2 returns ['Diglett', 'Dugtrio']

EDIT In light of the fact that there are apparently Pokemon with two words and special chars, I made this slightly more convoluted version of the above code

def pokeFinder(strng):
    wordList = strng.split()
    pokeList = []
    prevWasWord = False
    for word in wordList:
        if not set('%&').intersection(word) and 'for' not in word:
            clnWord = word.replace(',', '')
            if prevWasWord is True: # 2 poke in a row means same poke
                pokeList[-1] = pokeList[-1] + ' ' + clnWord
            else:
                pokeList.append(clnWord)
                prevWasWord = True
        else:
            prevWasWord = False
    return pokeList

If there's no "three word" pokemon, and the rules OP set remain constant, this should always work. 2 poke matches in a row adds to the previous pokemon.

So printing a string of '30% for Mr. Mime & 20% for Type: Null' gets ['Mr. Mime', 'Type: Null']

Upvotes: 0

b_c

Reputation: 1212

Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).

The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*, which looks to be working in my testing (here's the regex101 link for a breakdown).

For a general summary, I'm looking for:

1+ digits followed by a %
A space or the word "for" at least once
(To start the capture) A starting capital letter
At least one of (ending the capture group):
- a word character, a period, the male/female symbols, or an apostrophe
  - Note: If you want to catch additional "weird" pokemon characters, like numbers, colon, etc., add them to this portion (the [\w\.♀♂'] bit).
- OR a space, but only if followed by a capital letter
A comma, space, or ampersand, any number of times

Unless it's changed, Python's builtin re module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall and organized them into pairs (I replaced a couple names from your input with the complicated ones):

import re

str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'

pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"

# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
               for s in [str1, str2, str3, str4, str5]
               for match in re.findall(pattern, s)]

# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])

for pair in pairs:
    print(pair)

This then prints out:

('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')

Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)

Upvotes: 2

LeoE

Reputation: 2083

Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+ as regex. See regex101

Code:

import re

str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'

str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
    pokemon_list.append(re.findall(regex, x))
print(pokemon_list)

Output:

[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]

Upvotes: 1

How to extract specific strings using Python Regex

Answers (4)

Related Questions