Gabriel Amarante
Gabriel Amarante

Reputation: 58

Using regex to get different groups in pattern

Ok so I asked a question not long time ago but I forgot regex is very delicate and I showed the string in the wrong format.

The problem is, I receive a huge disorganized text that is all in one line.

In this line i have 2 different "blocks" I need: "Most frequent senders" and "Most frequent receivers"

As I said, it's all in one straight line, kinda like this:

 string = """ 
Huge text etc etc etc Most frequent senders: NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 times in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 times in total of: R$10.000,00 Most frequent recipients:     NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 14 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 012.345.678-90 (SOME RANDOM UPPERCASE TEXT) - 30 time(s) in total of: R$10.000,00 NAME OF THE PERSON - 01.234.567/0001-89 (SOME RANDOM UPPERCASE TEXT) - 10 time(s) in total of: R$10.000,00 More text after this.  """

As you can see, this is terribly disorganized but it's how I receive it.

Basically what I'm trying to do is get the name of the person, the ID (that can have 2 patterns xx.xxx.xxx/0001-xx or xxx.xxx.xxx-xx), the number of times and the amount (in BRL so R$).

I found a way to get the IDS but that is it, nothing more.

    r = re.compile(r' [0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2} | [0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2} ')

print(r.findall(string))

Any help would be very much appreciated.

Upvotes: 1

Views: 58

Answers (1)

Tranbi
Tranbi

Reputation: 12701

Supposing the name of the person is always uppercase and preceded by digits (or : for the first occurrence) and white space(s):

r = re.compile(r'(?<=[\d:])\s+([A-Z ]*) - ([0-9]{3}\.?[0-9]{3}\.?[0-9]{3}\-?[0-9]{2}|[0-9]{2}\.?[0-9]{3}\.?[0-9]{3}\/?[0-9]{4}\-?[0-9]{2}).*?- (\d*)\s.*?: R\$([\d\.,]+)')

Note: You had unnecessary white spaces in you original regex after/before the IDs. You should get more matches with this one.

Also you'll get a more beautiful output with the following command:

print(*r.findall(string), sep='\n')

Upvotes: 1

Related Questions