Reputation: 371
I'm trying to extract a string pattern like described below.
What I have = What I need
FII JS REAL CI ER 10 115,06 = FII JS REAL CI
IRBBRASIL RE ONNM 100 19,10 = IRBBRASIL RE ON
MAGAZ LUIZA ONNM 40 38,00 = MAGAZ LUIZA ON
IT NOW IBOV CI 3 103,14 = IT NOW IBOV CI
01/00 IT NOW IBOV CI 40 103,13 = IT NOW IBOV CI
ITAUSA PN EDJ N1 1.600 12,14 = ITAUSA PN
FII JS REAL CI # 10 120,00 = FII JS REAL CI
MAGAZ LUIZA PNNM 40 38,00 = MAGAZ LUIZA PN
01/00 PETROLEO BRA PN 30 14,48 = PETROLEO BRA PN
01/00 PETROLEO BRA PN = PETROLEO BRA PN
AMBEV S/A ON = AMBEV S/A ON
When I try
wordRegex = re.compile(r'(.*)((O|P)N|UNT|CI)')
word = wordRegex.search(cell_value).group(0).strip()
It works in almost every case, except the ones where I have number/dates at the beginning.
If I try something like (?<=\/\d{2}\s)(.*)((O|P)N|UNT|CI)
it will (obviously) only work in such cases.
I need help to figure out a pattern that works every time.
Upvotes: 1
Views: 861
Reputation: 25686
This returns what you've listed
([A-Z][A-Z \/]*(CI|ON|PN|UNT))
See it working on regex101.com.
NOTE: The initial [A-Z]
is to prevent the group returned from starting with a space. If you are OK with using trim()
, then you can remove it.
Update... Added Portguese characters as requested.
([A-ZÀÁÂÃÉÊÍÓÔÕÚÜ \/.]*(CI|ON|PN|UNT))
I got the Portuguese characters from this list. I manually added each one in because it's not all the characters from 192-220. If you are OK with adding in the other characters, you can do another range.
Also note... I removed the initial [A-Z] because it's just messy at this point once you add in all the new characters so you'll have to .trim()
the result to clean up any spaces at the start of the string.
Upvotes: 1
Reputation: 163517
You might also start the match with any character except for chars A-Za-z and use a capture group starting with an uppercase char A-Z ending the match with one of the alternatives.
^[^A-Za-z]*([A-Z/]+(?:\s+[A-Z/]+)*\s+(?:[OP]N|UNT|CI))
The pattern matches:
^
Start of string[^A-Za-z]*
Optionally match any char except a char A-Z a-z(
Capture group 1
[A-Z/]+
Match 1+ times a char A-Z or /(?:\s+[A-Z/]+)*
Optionally repeat matching 1+ whitespace chars and a char A-Z or /\s+(?:[OP]N|UNT|CI)
Match 1+ whitespace chars and one of the alternatives)
Close group 1See a regex demo or a Python demo.
For example
import re
strings = [
"FII JS REAL CI ER 10 115,06",
"IRBBRASIL RE ONNM 100 19,10",
"MAGAZ LUIZA ONNM 40 38,00",
"IT NOW IBOV CI 3 103,14",
"01/00 IT NOW IBOV CI 40 103,13",
"ITAUSA PN EDJ N1 1.600 12,14",
"FII JS REAL CI # 10 120,00",
"MAGAZ LUIZA PNNM 40 38,00",
"01/00 PETROLEO BRA PN 30 14,48",
"01/00 PETROLEO BRA PN",
"AMBEV S/A ON"
]
pattern = r"^[^A-Za-z]*([A-Z/]+(?:\s+[A-Z/]+)*\s+(?:[OP]N|UNT|CI))"
for s in strings:
m = re.match(pattern, s)
if m:
print(m.group(1))
Output
FII JS REAL CI
IRBBRASIL RE ON
MAGAZ LUIZA ON
IT NOW IBOV CI
IT NOW IBOV CI
ITAUSA PN
FII JS REAL CI
MAGAZ LUIZA PN
PETROLEO BRA PN
PETROLEO BRA PN
AMBEV S/A ON
Upvotes: 2