Reputation: 58
Ok so I asked a question not long time ago but I forgot regex is very delicate and I showed the string in the wrong format.
The problem is, I receive a huge disorganized text that is all in one line.
In this line i have 2 different "blocks" I need: "Most frequent senders" and "Most frequent receivers"
As I said, it's all in one straight line, kinda like this:
string = """
Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 time(s) in total of: R$10.000,00 More text after this. """
As you can see, this is terribly disorganized but it's how I receive it.
Basically what I'm trying to do is get the name of the person, the ID (that can have 2 patterns xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times and the amount (in BRL so R$).
I found a way to get the IDS but that is it, nothing more.
r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')
print(r.findall(string))
Any help would be very much appreciated.
Upvotes: 1
Views: 58
Reputation: 12701
Supposing the name of the person is always uppercase and preceded by digits (or :
for the first occurrence) and white space(s):
r = re.compile(r'(?<=[\d:])\s+([A-Z ]*) - ([0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2}|[0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2}).*?- (\d*)\s.*?: R\$([\d\.,]+)')
Note: You had unnecessary white spaces in you original regex after/before the IDs. You should get more matches with this one.
Also you'll get a more beautiful output with the following command:
print(*r.findall(string), sep='\n')
Upvotes: 1