Josh41
Josh41

Reputation: 51

Python regular expression for multiple split criteria

I'm struggling to split some text in a piece of code that I'm writing. This software is scanning through about 3.5 million lines of text of which there are varying formats throughout.

I'm kind of working my way through everything still, but the line below appears to be fairly standard within the file:

EXAMPLE_FILE_TEXT ID="20211111.111111 11111"

I want to split it as follows:

EXAMPLE_FILE_TEXT, ID, 20211111.111111 11111

As much as possible, I'd prefer to avoid hard coding any certain text to look for as I'm still parsing through the file & trying to determine all the different variables. I've tried running the following code:

conditioned_line = re.sub(r'(\w+=)(\w+)', r'\1"\2"', input_line)
output = shlex.split(conditioned_line)

When I run this code, I'm getting this output:

['EXAMPLE_FILE_TEXT', 'ID=20211111.111111 11111']

I've managed to successfully split each and every element of this, but I have not managed to split them all together successfully. I suspect this is manageable via a regular expression, or with a regular expression and a shlex split, but I could really use some suggestions if anyone has some ideas.

As requested, here's another example of some text that's in the file I'm scanning:

EXAMPLE_TEXT TAG="AB-123-ABCD_$B" ABCDE_ABCD="ABCD_A" ABCDEF_ABCDE="ABCDEF_ABCDEF_$A" ABCDEFGH=""

This should separate to the following:

EXAMPLE_TEXT, TAG, AB-123-ABCD_$B, ABCDE_ABCD, ABCD_A, ABCDEF_ABCDE, ABCDEF_ABCDEF_$A, ABCDEFGH

Upvotes: 2

Views: 84

Answers (2)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 627083

I suggest a tokenizing approach with regex: create a regex with alternations, starting with the most specific ones, and ending with somewhat generic ones.

In your case, you may try

import re
x = 'EXAMPLE_FILE_TEXT ID="20211111.111111 11111"'
res = re.findall(r'"([^"]*)"|(\d+(?:\.\d+)*)|(\w+)', x)
print( ["".join(r) for r in res] )
# => ['EXAMPLE_FILE_TEXT', 'ID', '20211111.111111 11111']

See the Python demo.

The regex matches

  • "([^"]*)" - a string between two double quotes: " matches a ", then ([^"]*) captures zero or more chars other than " and then " matches a " char (NOTE: to match string between quotes with escaped quote support use "([^"\\]*(?:\\.[^"\\]*)*)", add a similar pattern for single quotes if needed)
  • | - or
  • (\d+(?:\.\d+)*) - Group 2: one or more digits and then zero or more sequences of . and one or more digits
  • | - or
  • (\w+) - Group 3: one or more word chars.

Upvotes: 2

Bhargav
Bhargav

Reputation: 4107

What you can try is

import re

text = 'EXAMPLE_FILE_TEXT ID="20211111.111111 11111"'
pattern = r'(\w+)\s+(\w+)="([^"]*)"'
matches = re.findall(pattern, text)

if matches:
    result = list(matches[0])  
    print(result)

which results

=================== RESTART: C:\Users\Bhargav\Desktop\test.py ==================
['EXAMPLE_FILE_TEXT', 'ID', '20211111.111111 11111']

Explination

(\w+) - EXAMPLE_FILE_TEXT
\s+ - Matches whitespace
(\w+) - ID
=" - Matches equals and opening quote
([^"]*) - Captures everything inside the quotes
" - Matches the closing quote

If you looking like as you mentioned in your question then

print(','.join(result))

Results

EXAMPLE_FILE_TEXT,ID,20211111.111111 11111

Upvotes: 2

Related Questions