Ajax
Ajax

Reputation: 179

Extracting numbers till a certain paragraph using multi condition Regex in python

I am new to regex. I am trying to extract the data from log files and each files has text like this:

crt - 00:00:00 up 200 days, 23:35, 0 users, load average: 0.04, 0.05, 0.02
Tasks: 300 total, 2 running, 298 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.0%us, 2.5%sy, 0.0%ni, 89.2%id, 0.0%hi, 0.1%si, 0.0%st
Mem: 123456K total, 1234567k used, 989991k free, 11156793k buffers
Swap: 456K total, 30897564k used, 785431k free, 23445897k cached

PID User Pr NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND

I am extracting only digit values till the word cached. For this i am buliding different patterns for each digit and then extracting values in a list using finditer. My code till now:

[x.group()for x in re.finditer(r"(\d{2}:\d{2}:\d{2})|(\d+\.\d+?)%id"), text]

This is a fragment of regex where i have to specify pattern for every digit like suffix and prefix string. Is there a more efficient way to take output?

desired_values=[00:00:00, 200, 23:35, 0, 0.04, 0.05, 0.02 , 
               300, 2, 298, 0, 0, 
               12.0, 2.5, 0.0, 89.2, 0.0, 0.1, 0.0, 
               123456, 1234567, 989991, 11156793, 
               9234456, 30897564, 785431, 23445897]

These values then i insert in database, that's why they should be in list.

Upvotes: 1

Views: 110

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627087

You may use

r'(?s)(?<!\d)(?:\d{2}:\d{2}(?::\d{2})?|\d*\.?\d+)(?!\d)(?=.*\bcached\b)'

See the regex demo

Details

  • (?<!\d) - no digit immediately to the left is allowed
  • (?:\d{2}:\d{2}(?::\d{2})?|\d*\.?\d+) - either of
    • \d{2}:\d{2}(?::\d{2})? - 2 digits, :, 2 digits and then an optional sequence of : and 2 digits
    • | - or
    • \d*\.?\d+ - 0+ digits, an optional . and then 1+ digits
  • (?!\d) - no digit immediately to the right is allowed
  • (?=.*\bcached\b) - there must be a word cached somewhere to the right of the current location.

Python demo:

import re
text = r"""crt - 00:00:00 up 200 days, 23:35, 0 users, load average: 0.04, 0.05, 0.02
Tasks: 300 total, 2 running, 298 sleeping, 0 stopped, 0 zombie
Cpu(s): 12.0%us, 2.5%sy, 0.0%ni, 89.2%id, 0.0%hi, 0.1%si, 0.0%st
Mem: 123456K total, 1234567k used, 989991k free, 11156793k buffers
Swap: 456K total, 30897564k used, 785431k free, 23445897k cached
 
PID User Pr NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND"""
print( re.findall(r'(?<!\d)(?:\d{2}:\d{2}(?::\d{2})?|\d*\.?\d+)(?!\d)(?=.*\bcached\b)', text, re.S) )

Output:

['00:00:00', '200', '23:35', '0', '0.04', '0.05', '0.02', '300', '2', '298', '0', '0', '12.0', '2.5', '0.0', '89.2', '0.0', '0.1', '0.0', '123456', '1234567', '989991', '11156793', '456', '30897564', '785431', '23445897']

Upvotes: 1

Related Questions