HCSthe2nd
HCSthe2nd

Reputation: 185

RegEx - Getting multiline content between two delimiters

I'm trying to get the (multi-line) content between two delimiters using regex.

This is a snip of the large file that I'm parsing:

---------------------------------
CARTESIAN COORDINATES (ANGSTROEM)
---------------------------------
  Co     0.000000    0.000000    0.000000
  O      4.000000    0.000000    0.000000
  H      4.584210    0.809570    0.000000
  H      4.583362   -0.810106   -0.001552

----------------------------
CARTESIAN COORDINATES (A.U.)
----------------------------
  NO LB      ZA    FRAG     MASS         X           Y           Z
   0 Co   27.0000    0    58.930    0.000000    0.000000    0.000000
   1 O     8.0000    0    15.999    7.558905    0.000000    0.000000
   2 H     1.0000    0     1.008    8.662901    1.529866    0.000000
   3 H     1.0000    0     1.008    8.661299   -1.530878   -0.002933

--------------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
 Co     0   0   0     0.000000000000     0.00000000     0.00000000
 O      1   0   0     4.000000000000     0.00000000     0.00000000
 H      2   1   0     0.998351400325   125.81533088     0.00000000
 H      2   1   3     0.998290967411   125.75782425   180.10977070

---------------------------
INTERNAL COORDINATES (A.U.)
---------------------------
 Co     0   0   0     0.000000000000     0.00000000     0.00000000
 O      1   0   0     7.558904535685     0.00000000     0.00000000
 H      2   1   0     1.886610732031   125.81533088     0.00000000
 H      2   1   3     1.886496530374   125.75782425   180.10977070

I'm interested only in the "INTERNAL COORDINATES (ANGSTROEM)" section (only the atoms and coordinates). So, this is what I want to keep:

 Co     0   0   0     0.000000000000     0.00000000     0.00000000
 O      1   0   0     4.000000000000     0.00000000     0.00000000
 H      2   1   0     0.998351400325   125.81533088     0.00000000
 H      2   1   3     0.998290967411   125.75782425   180.10977070

This is my regular expression:

r"INTERNAL COORDINATES \(ANGSTROEM\)\n(--------------------------------)\n([\s\S]*?)\n(---------------------------)"

And this is my code, so far:

import re
import pandas as pd

with open(input_path, "r") as inp:
    inp_content = inp.read()

    int_coord = r"INTERNAL COORDINATES \(ANGSTROEM\)\n(--------------------------------)\n([\s\S]*?)\n(---------------------------)"
    coord_matches = re.finditer(int_coord, inp_content, re.MULTILINE)

    for i in coord_matches:
        my_var = i.group(0)
        print(my_var)

The problem is that I'm getting the section INCLUDING the delimiters, like this:

---------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
 Co     0   0   0     0.000000000000     0.00000000     0.00000000
 O      1   0   0     1.200000000000     0.00000000     0.00000000
 H      2   1   0     0.998351400325   125.81533088     0.00000000
 H      2   1   3     0.998290967411   125.75782425   180.10977070

---------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
 Co     0   0   0     0.000000000000     0.00000000     0.00000000
 O      1   0   0     1.100000000000     0.00000000     0.00000000
 H      2   1   0     0.998351400325   125.81533088     0.00000000
 H      2   1   3     0.998290967411   125.75782425   180.10977070

---------------------------
INTERNAL COORDINATES (ANGSTROEM)
--------------------------------
 Co     0   0   0     0.000000000000     0.00000000     0.00000000
 O      1   0   0     1.000000000000     0.00000000     0.00000000
 H      2   1   0     0.998351400325   125.81533088     0.00000000
 H      2   1   3     0.998290967411   125.75782425   180.10977070

---------------------------

How can I get only the atomic coordinates?

Thanks in advance for any help.

Upvotes: 1

Views: 44

Answers (1)

Silvanas
Silvanas

Reputation: 613

Your current regex is a little incorrect, moreover you are accessing group(0) in your python code which will give you full match, where as you should do the capture in group1 and get group1's content back. Here is the modified regex that will give you exactly just what you wanted,

INTERNAL COORDINATES \(ANGSTROEM\)\n(?:--------------------------------)\n((?:(?!-+)[\s\S])*)\n(?:---------------------------)

Check this Demo

And this python code demo

Upvotes: 2

Related Questions