Reputation: 2481
I have very challenging strings that I have been struggling.
For example,
str1 = '95% for Pikachu, 92% for Sandshrew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
String starts with %
integer and may have for
or not, then following with name of pokemon. There might be comma(,)
or &
sign then new %
integer. Finally there is another name of pokemon.(All start with capital case alphabet)
I want to extract two pokemons, for example result,
['Pikachu', 'Sandshrew']
['Paras', 'Arcanine']
['Diglett', 'Dugtrio']
['Squirtle', 'Alakazam']
['Metopod', 'Dewgong']
I can create a list of all pokemen then using in
syntax, but it is not the best way(In case they add more pokemon). Is it possible to extract using Regex?
Thanks in advance!
EDIT
As requested, I am adding my code,
str_list = [str1, str2, str3, str4, str5]
for x in str_list:
temp_list = []
if 'for' in x:
temp = x.split('% for', 1)[1].strip()
temp_list.append(temp)
else:
temp = x.split(" ", 1)[1]
temp_list.append(temp)
print(temp_list)
I know it is not regex express. The expression I tried is, \d+
to
extract integer to start... but have no idea how to start.
EDIT2
@b_c has good edge case so, I am adding it here
edge_str = '100% for Pikachu, 29% Pika Pika Pikachu'
result
['Pikachu', 'Pika Pika Pikachu']
Upvotes: 3
Views: 164
Reputation: 2595
Use a positive lookbehind, this will work regardless of capitalization.
(?<=\d\d% for )[A-Za-z]+|(?<=\d% for )[A-Za-z]+
EDIT: Changed it to work in Python.
Upvotes: 0
Reputation: 710
An alternate method if you dont want to use regex and you don't want to rely on capitalization
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
for word in wordList:
if not set('[~!@#$%^&*()_+{}":;\']+$').intersection(word) and 'for' not in word:
pokeList.append(word.replace(',', ''))
return pokeList
This won't add words with special chars. It also won't add words that are for
. Then it removes commas from the found words.
A print of str2
returns ['Diglett', 'Dugtrio']
EDIT In light of the fact that there are apparently Pokemon with two words and special chars, I made this slightly more convoluted version of the above code
def pokeFinder(strng):
wordList = strng.split()
pokeList = []
prevWasWord = False
for word in wordList:
if not set('%&').intersection(word) and 'for' not in word:
clnWord = word.replace(',', '')
if prevWasWord is True: # 2 poke in a row means same poke
pokeList[-1] = pokeList[-1] + ' ' + clnWord
else:
pokeList.append(clnWord)
prevWasWord = True
else:
prevWasWord = False
return pokeList
If there's no "three word" pokemon, and the rules OP set remain constant, this should always work. 2 poke matches in a row adds to the previous pokemon.
So printing a string of '30% for Mr. Mime & 20% for Type: Null'
gets
['Mr. Mime', 'Type: Null']
Upvotes: 0
Reputation: 1212
Hopefully I didn't over engineer this, but I wanted to cover the edge cases of the slightly-more-complicated named pokemon, such as "Mr. Mime", "Farfetch'd", and/or "Nidoran♂" (only looking at the first 151).
The pattern I used is (?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*
, which looks to be working in my testing (here's the regex101 link for a breakdown).
For a general summary, I'm looking for:
[\w\.♀♂']
bit).Unless it's changed, Python's builtin re
module doesn't support repeated capture groups (which I believe I did correctly), so I just used re.findall
and organized them into pairs (I replaced a couple names from your input with the complicated ones):
import re
str1 = '95% for Pikachu, 92% for Mr. Mime'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = "10% Squirtle, 100% for Farfetch'd"
str5 = '30% Metopod & 99% Nidoran♂'
pattern = r"(?:(?:\d+%(?: |for)+([A-Z](?:[\w\.♀♂']|(?: (?=[A-Z])))+))+)[, &]*"
# Find matches in each string, then unpack each list of
# matches into a flat list
all_matches = [match
for s in [str1, str2, str3, str4, str5]
for match in re.findall(pattern, s)]
# Pair up the matches
pairs = zip(all_matches[::2], all_matches[1::2])
for pair in pairs:
print(pair)
This then prints out:
('Pikachu', 'Mr. Mime')
('Paras', 'Arcanine')
('Diglett', 'Dugtrio')
('Squirtle', "Farfetch'd")
('Metopod', 'Nidoran♂')
Also, as was already mentioned, you do have a few typos in the pokemon names, but a regex isn't the right fix for that unfortunately :)
Upvotes: 2
Reputation: 2083
Since there seems to be no other upper-case letter in your strings you can simply use [A-Z]\w+
as regex.
See regex101
Code:
import re
str1 = '95% for Pikachu, 92% for Sandsherew'
str2 = '70% for Paras & 100% Arcanine'
str3 = '99% Diglett, 40% Dugtrio'
str4 = '10% Squirtle, 100% for Alakazam'
str5 = '30% Metopod & 99% Dewgong'
str_list = [str1, str2, str3, str4, str5]
regex = re.compile('[A-Z]\w+')
pokemon_list = []
for x in str_list:
pokemon_list.append(re.findall(regex, x))
print(pokemon_list)
Output:
[['Pikachu', 'Sandsherew'], ['Paras', 'Arcanine'], ['Diglett', 'Dugtrio'], ['Squirtle', 'Alakazam'], ['Metopod', 'Dewgong']]
Upvotes: 1