Crusader
Crusader

Reputation: 333

Extract information from the string and convert to list

I have a string like below:

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,

[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)

I want to extract the value of "X" and the text associated and convert it to a list. Please, see the expected output below:

Expected Output:

['X=250.44','DECEMBER 31,']
['X=307.5','respectively. The net decrease in the revenue']
['X=49.5','(US$ in millions)']

How can we approach this in Python?

MyApproach:

mylist = []
for line in data.split("\n"):
    if line.strip():
        x_coord = re.findall('^(X=.*)\,$', line)
        text = re.findall('^(]\w +)', line)
        mylist.append([x_coord, text])

My approach does not identify any value for x_coord and text.

Upvotes: 3

Views: 74

Answers (3)

jupiterbjy
jupiterbjy

Reputation: 3550

Try this:

(X=[^,]*)(?:.*])(.*)
import re

source = """[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)""".split('\n')

pattern = r"(X=[^,]*)(?:.*])(.*)"

for line in source:
    print(re.search(pattern, line).groups())

Output:

('X=250.44', 'DECEMBER 31,')
('X=307.5', 'respectively. The net decrease in the revenue')
('X=49.5', '(US$ in millions)')

You have X= in front of all captures, so I just did a capture group, feel free to add non-capturing group if that matters.

Upvotes: 2

Jan Stránský
Jan Stránský

Reputation: 1691

re solution:

import re

input = [
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,",
    "[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue",
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)",
]

def extract(s):
    match = re.search("(X=\d+(?:\.\d*)?).*?\](.*?)$",s)
    return match.groups()

output = [extract(item) for item in input]
print(output)

Output:

[
    ('X=250.44', 'DECEMBER 31,'),
    ('X=307.5', 'respectively. The net decrease in the revenue'),
    ('X=49.5', '(US$ in millions)'),
]

Explanation:

  • \d ... digit
  • \d+ ... one or more digits
  • (?:...) ... non-capturing ("normal") parentheses
  • \.\d* ... dot followed by zero or more digits
  • (?:\.\d*)? ... optional (zero or one) "decimal part"
  • (X=\d+(?:\.\d*)?) ... first group, X=number
  • .*? ... zero or more of any character (non-greedy)
  • \] ... ] symbol
  • $ ... end of string
  • \](.*?)$ ... second group, anything between ] and end of string

Upvotes: 2

tzaman
tzaman

Reputation: 47870

Using regex with named groups to capture the relevant bits:

>>> line = "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,"
>>> m = re.search(r'(?:\(X=)(?P<x_coord>.*?)(?:,.*])(?P<text>.*)$', line)
>>> m.groups()
('250.44', 'DECEMBER 31,')
>>> m['x_coord']
'250.44'
>>> m['text']
'DECEMBER 31,'

Upvotes: 2

Related Questions