Reputation: 333

Extract information from the string and convert to list

I have a string like below:

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,

[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue

[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)

I want to extract the value of "X" and the text associated and convert it to a list. Please, see the expected output below:

Expected Output:

['X=250.44','DECEMBER 31,']
['X=307.5','respectively. The net decrease in the revenue']
['X=49.5','(US$ in millions)']

How can we approach this in Python?

MyApproach:

mylist = []
for line in data.split("\n"):
    if line.strip():
        x_coord = re.findall('^(X=.*)\,$', line)
        text = re.findall('^(]\w +)', line)
        mylist.append([x_coord, text])

My approach does not identify any value for x_coord and text.

Upvotes: 3

Answers (3)

jupiterbjy

Reputation: 3550

Try this:

(X=[^,]*)(?:.*])(.*)

import re

source = """[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,
[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue
[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)""".split('\n')

pattern = r"(X=[^,]*)(?:.*])(.*)"

for line in source:
    print(re.search(pattern, line).groups())

Output:

('X=250.44', 'DECEMBER 31,')
('X=307.5', 'respectively. The net decrease in the revenue')
('X=49.5', '(US$ in millions)')

You have X= in front of all captures, so I just did a capture group, feel free to add non-capturing group if that matters.

Upvotes: 2

Jan Stránský

Reputation: 1691

re solution:

import re

input = [
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,",
    "[Base Font : IOFOEO+Imago-Book, Font Size : 3.876, Font Weight : 0.0] [(X=307.5,Y=240.48499) height=3.876 width=2.9970093]respectively. The net decrease in the revenue",
    "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=49.5,Y=233.98499) height=3.5324998 width=2.5690002](US$ in millions)",
]

def extract(s):
    match = re.search("(X=\d+(?:\.\d*)?).*?\](.*?)$",s)
    return match.groups()

output = [extract(item) for item in input]
print(output)

Output:

[
    ('X=250.44', 'DECEMBER 31,'),
    ('X=307.5', 'respectively. The net decrease in the revenue'),
    ('X=49.5', '(US$ in millions)'),
]

Explanation:

\d ... digit
\d+ ... one or more digits
(?:...) ... non-capturing ("normal") parentheses
\.\d* ... dot followed by zero or more digits
(?:\.\d*)? ... optional (zero or one) "decimal part"
(X=\d+(?:\.\d*)?) ... first group, X=number
.*? ... zero or more of any character (non-greedy)
\] ... ] symbol
$ ... end of string
\](.*?)$ ... second group, anything between ] and end of string

Upvotes: 2

tzaman

Reputation: 47870

Using regex with named groups to capture the relevant bits:

>>> line = "[Base Font : IOHLGA+Trebuchet, Font Size : 3.5324998, Font Weight : 0.0] [(X=250.44,Y=223.48499) height=3.5324998 width=4.2910004]DECEMBER 31,"
>>> m = re.search(r'(?:\(X=)(?P<x_coord>.*?)(?:,.*])(?P<text>.*)$', line)
>>> m.groups()
('250.44', 'DECEMBER 31,')
>>> m['x_coord']
'250.44'
>>> m['text']
'DECEMBER 31,'

Upvotes: 2

Extract information from the string and convert to list

Answers (3)

Related Questions