Reputation: 185
I'm trying to get the (multi-line) content between two delimiters using regex.
This is a snip of the large file that I'm parsing:
---------------------------------
CARTESIAN COORDINATES (ANGSTROEM)
---------------------------------
Co 0.000000 0.000000 0.000000
O 4.000000 0.000000 0.000000
H 4.584210 0.809570 0.000000
H 4.583362 -0.810106 -0.001552
----------------------------
CARTESIAN COORDINATES (A.U.)
----------------------------
NO LB ZA FRAG MASS X Y Z
0 Co 27.0000 0 58.930 0.000000 0.000000 0.000000
1 O 8.0000 0 15.999 7.558905 0.000000 0.000000
2 H 1.0000 0 1.008 8.662901 1.529866 0.000000
3 H 1.0000 0 1.008 8.661299 -1.530878 -0.002933
--------------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
Co 0 0 0 0.000000000000 0.00000000 0.00000000
O 1 0 0 4.000000000000 0.00000000 0.00000000
H 2 1 0 0.998351400325 125.81533088 0.00000000
H 2 1 3 0.998290967411 125.75782425 180.10977070
---------------------------
INTERNAL COORDINATES (A.U.)
---------------------------
Co 0 0 0 0.000000000000 0.00000000 0.00000000
O 1 0 0 7.558904535685 0.00000000 0.00000000
H 2 1 0 1.886610732031 125.81533088 0.00000000
H 2 1 3 1.886496530374 125.75782425 180.10977070
I'm interested only in the "INTERNAL COORDINATES (ANGSTROEM)" section (only the atoms and coordinates). So, this is what I want to keep:
Co 0 0 0 0.000000000000 0.00000000 0.00000000
O 1 0 0 4.000000000000 0.00000000 0.00000000
H 2 1 0 0.998351400325 125.81533088 0.00000000
H 2 1 3 0.998290967411 125.75782425 180.10977070
This is my regular expression:
r"INTERNAL COORDINATES \(ANGSTROEM\)\n(--------------------------------)\n([\s\S]*?)\n(---------------------------)"
And this is my code, so far:
import re
import pandas as pd
with open(input_path, "r") as inp:
inp_content = inp.read()
int_coord = r"INTERNAL COORDINATES \(ANGSTROEM\)\n(--------------------------------)\n([\s\S]*?)\n(---------------------------)"
coord_matches = re.finditer(int_coord, inp_content, re.MULTILINE)
for i in coord_matches:
my_var = i.group(0)
print(my_var)
The problem is that I'm getting the section INCLUDING the delimiters, like this:
---------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
Co 0 0 0 0.000000000000 0.00000000 0.00000000
O 1 0 0 1.200000000000 0.00000000 0.00000000
H 2 1 0 0.998351400325 125.81533088 0.00000000
H 2 1 3 0.998290967411 125.75782425 180.10977070
---------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
Co 0 0 0 0.000000000000 0.00000000 0.00000000
O 1 0 0 1.100000000000 0.00000000 0.00000000
H 2 1 0 0.998351400325 125.81533088 0.00000000
H 2 1 3 0.998290967411 125.75782425 180.10977070
---------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
Co 0 0 0 0.000000000000 0.00000000 0.00000000
O 1 0 0 1.000000000000 0.00000000 0.00000000
H 2 1 0 0.998351400325 125.81533088 0.00000000
H 2 1 3 0.998290967411 125.75782425 180.10977070
---------------------------
How can I get only the atomic coordinates?
Thanks in advance for any help.
Upvotes: 1
Views: 44
Reputation: 613
Your current regex is a little incorrect, moreover you are accessing group(0)
in your python code which will give you full match, where as you should do the capture in group1 and get group1's content back. Here is the modified regex that will give you exactly just what you wanted,
INTERNAL COORDINATES \(ANGSTROEM\)\n(?:--------------------------------)\n((?:(?!-+)[\s\S])*)\n(?:---------------------------)
And this python code demo
Upvotes: 2