Travis Mitchell
Travis Mitchell

Reputation: 127

RegEx Catastrophic Backtracking

I'm trying to code a program to handle some data from some computations for me and my group of chemists. I need to be able to search through an output file that contains an excess of 100,000 lines with repeating units. I'm having trouble developing an expression to pull the data that I need. Here is a sample portion of the output file from which I need to extract data.

--------------------------------------------------------
6 0.934039 1.910373 -0.007356
6 0.522681 1.025902 1.132490
6 1.295895 0.175184 1.887754
6 2.745917 -0.059133 1.663889
6 3.251755 -1.317777 1.620723
6 2.354786 -2.503163 1.659226
6 2.500165 -3.540741 2.548878
16 1.200827 -4.674482 2.398451
6 0.450003 -3.761272 1.124048
6 1.191704 -2.653037 0.838883
1 0.889391 -1.914706 0.104995
6 -0.831020 -4.156701 0.527821
6 -1.110322 -3.881221 -0.814127
6 -2.349055 -4.235426 -1.332810
7 -3.309523 -4.841208 -0.630339
6 -3.035523 -5.105663 0.648864
6 -1.833017 -4.789011 1.269474
1 -1.692631 -5.008941 2.323469
1 -3.826405 -5.592591 1.215268
1 -2.579432 -4.028326 -2.375778
1 -0.365461 -3.412291 -1.448955
6 3.573349 -3.754217 3.574993
1 3.816975 -2.812111 4.074804
1 3.260231 -4.473156 4.337010
1 4.494421 -4.124194 3.111155
6 4.710061 -1.617156 1.557050
6 5.623546 -1.003416 2.421004
6 6.974181 -1.326012 2.370518
6 7.436595 -2.269579 1.456291
6 6.535520 -2.894414 0.598497
6 5.182590 -2.576469 0.655016
1 4.479434 -3.078256 -0.004163
1 6.885058 -3.634965 -0.114866
1 8.492504 -2.519869 1.416345
1 7.667827 -0.841306 3.051107
1 5.267975 -0.268736 3.136940
6 3.580944 1.164538 1.503126
6 4.497172 1.300372 0.454818
6 5.240591 2.465784 0.310519
6 5.081537 3.515835 1.211519
6 4.166656 3.395074 2.253939
6 3.418260 2.231401 2.393481
1 2.692677 2.145384 3.197718
1 4.033033 4.209217 2.960243
1 5.661476 4.426693 1.096865
1 5.945953 2.554610 -0.510417
1 4.622653 0.485776 -0.251950
6 0.528985 -0.506746 2.884231
6 -0.800876 -0.209971 2.864388
16 -1.134679 0.961768 1.625169
6 -1.858335 -0.812159 3.682689
6 -1.740268 -2.136749 4.118466
6 -2.752842 -2.684280 4.894687
7 -3.862544 -2.030644 5.248979
6 -3.974818 -0.772056 4.819134
6 -3.016393 -0.122197 4.049880
1 -3.167697 0.912109 3.755521
1 -4.882415 -0.247888 5.110965
1 -2.672356 -3.711555 5.245040
1 -0.881625 -2.737495 3.833528
1 0.964846 -1.208304 3.585910
1 0.076921 2.185394 -0.628553
1 1.403779 2.830811 0.357050
1 1.665791 1.402320 -0.641683
Energy (Hartree) = -2216.64927779
Step: 34
Scan 1 out of 73
Converged(Max Force, RMS Force, Max Disp, RMS Disp): YES, YES, NO, YES
--------------------------------------------------------

--------------------------------------------------------
6 0.934062 1.911021 -0.006793
6 0.522693 1.026243 1.132808
6 1.295849 0.175204 1.887786
6 2.745849 -0.059151 1.663825
6 3.251670 -1.317799 1.620645
6 2.354637 -2.503134 1.659254
6 2.499913 -3.540512 2.549163
16 1.200720 -4.674404 2.398679
6 0.450032 -3.761466 1.124001
6 1.191706 -2.653228 0.838749
1 0.889499 -1.915053 0.104660
6 -0.830799 -4.157197 0.527556
6 -1.109977 -3.881704 -0.814420
6 -2.348529 -4.236224 -1.333319
7 -3.308924 -4.842349 -0.631043
6 -3.035027 -5.106853 0.648170
6 -1.832715 -4.789884 1.268992
1 -1.692398 -5.009943 2.322969
1 -3.825839 -5.594087 1.214409
1 -2.578819 -4.029095 -2.376300
1 -0.365153 -3.412510 -1.449097
6 3.572924 -3.753599 3.575540
1 3.816308 -2.811319 4.075158
1 3.259759 -4.472389 4.337671
1 4.494142 -4.123544 3.111975
6 4.709955 -1.617231 1.556911
6 5.623512 -1.003434 2.420751
6 6.974133 -1.326076 2.370222
6 7.436469 -2.269740 1.456057
6 6.535326 -2.894639 0.598379
6 5.182409 -2.576651 0.654945
1 4.479201 -3.078478 -0.004148
1 6.884808 -3.635261 -0.114937
1 8.492367 -2.520070 1.416076
1 7.667829 -0.841314 3.050721
1 5.268008 -0.268672 3.136635
6 3.580909 1.164495 1.503041
6 4.497111 1.300286 0.454710
6 5.240549 2.465680 0.310363
6 5.081547 3.515754 1.211345
6 4.166716 3.395021 2.253815
6 3.418302 2.231369 2.393405
1 2.692762 2.145371 3.197682
1 4.033146 4.209176 2.960116
1 5.661485 4.426605 1.096636
1 5.945883 2.554473 -0.510601
1 4.622556 0.485666 -0.252039
6 0.528927 -0.506827 2.884183
6 -0.800849 -0.209681 2.864659
16 -1.134600 0.962406 1.625752
6 -1.858353 -0.811785 3.682977
6 -1.740809 -2.136677 4.117983
6 -2.753416 -2.684132 4.894214
7 -3.862669 -2.030122 5.249222
6 -3.974426 -0.771225 4.820153
6 -3.015920 -0.121422 4.050946
1 -3.166775 0.913141 3.757259
1 -4.881643 -0.246746 5.112609
1 -2.673367 -3.711662 5.243914
1 -0.882533 -2.737672 3.832462
1 0.964690 -1.208688 3.585621
1 0.076974 2.186246 -0.627940
1 1.403854 2.831315 0.357930
1 1.665819 1.403131 -0.641235
Energy (Hartree) = -2216.64927781
Step: 35
Scan 1 out of 73
Converged(Max Force, RMS Force, Max Disp, RMS Disp): YES, YES, YES, YES
Optimized Parameters for Coordinate Value: 48.7864
--------------------------------------------------------

--------------------------------------------------------
6 0.928653 1.914728 -0.015952
6 0.523104 1.029664 1.125513
6 1.299323 0.175435 1.873723
6 2.746979 -0.062752 1.638872
6 3.250582 -1.322711 1.610272
6 2.350949 -2.505931 1.653003
6 2.487964 -3.535693 2.553022
16 1.187646 -4.668406 2.403419
6 0.447689 -3.765331 1.115501
6 1.193508 -2.661056 0.825700
1 0.897892 -1.928805 0.083033
6 -0.829701 -4.163896 0.513583
6 -1.098964 -3.899654 -0.832683
6 -2.334534 -4.256414 -1.357129
7 -3.300974 -4.854584 -0.656318
6 -3.036531 -5.108382 0.627053
6 -1.837984 -4.788211 1.253491
1 -1.705454 -4.999308 2.310312
1 -3.832212 -5.589178 1.191978
1 -2.557135 -4.057999 -2.403474
1 -0.348825 -3.437415 -1.466208
6 3.553334 -3.741710 3.588769
1 3.794918 -2.795527 4.081850
1 3.233495 -4.453260 4.354918
1 4.477112 -4.117327 3.134952
6 4.708676 -1.625521 1.559405
6 5.617384 -1.005953 2.424241
6 6.967681 -1.331651 2.386026
6 7.434526 -2.284193 1.483422
6 6.538165 -2.914834 0.624932
6 5.185521 -2.593731 0.669206
1 4.485944 -3.099951 0.009606
1 6.891156 -3.662361 -0.079406
1 8.490177 -2.536916 1.453052
1 7.657559 -0.842289 3.067116
1 5.258339 -0.264251 3.131155
6 3.588296 1.158822 1.495804
6 4.455584 1.334979 0.412017
6 5.190024 2.506923 0.274817
6 5.065591 3.527255 1.214482
6 4.191377 3.371568 2.286682
6 3.451948 2.201349 2.419198
1 2.755016 2.090210 3.245500
1 4.081099 4.164280 3.020955
1 5.637641 4.443718 1.104935
1 5.859723 2.624932 -0.572054
1 4.551226 0.545487 -0.327323
6 0.537765 -0.505652 2.874875
6 -0.791246 -0.204606 2.865525
16 -1.130701 0.970006 1.630565
6 -1.844454 -0.804660 3.690867
6 -1.727716 -2.130536 4.123087
6 -2.736173 -2.676023 4.906078
7 -3.840730 -2.019138 5.270315
6 -3.951832 -0.759281 4.843904
6 -2.997115 -0.111294 4.068471
1 -3.146985 0.924153 3.777407
1 -4.855222 -0.232457 5.143904
1 -2.656671 -3.704306 5.253685
1 -0.873433 -2.733722 3.830294
1 0.976622 -1.209660 3.572222
1 0.067886 2.192910 -0.630666
1 1.403422 2.833365 0.346506
1 1.654564 1.405702 -0.656173
Energy (Hartree) = -2216.64908578
Step: 1
Scan 2 out of 73
Converged(Max Force, RMS Force, Max Disp, RMS Disp): NO, YES, NO, NO
--------------------------------------------------------

Each section contains a set of coordinates for elements of a molecule, the total energy for the molecule, the step for the computational scan, and convergence criteria. If all four convergence criterion are met, the optimized coordinate scan value is added. The data I need to extract are the coordinates, the total energy, the scan number, and coordinate value for a converged block only!

I've tried tirelessly to develop a suitable expression to use for extracting the required sets of data. Currently, my RegEx code looks like the following:

\-*?\n(\d{1,2}(?:\s+[+-]?\d.*?)+)?\n\w+.*?([+-]?\d+\.\d+)\n\w.*?\n\w+\s(\d+).*?\n\w.*\n

That code is able to capture the entire section for coordinates, the energy, and the step number. Every time I try to go to the next line using \w, I immediately receive a catastrophic backtracking error. it's imperative that I have that next line in the expression as it's what differentiates the desired block of data over the other. I'm not terribly great with python or RegEx, and I'm requesting help. Once I have the correct expression, I'll be using a nested for loop to extract all of the data I need!

Demo

If there are any other questions I can answer to better describe my situation, please let me know! An explanation of what you do to help will be much appreciated as I want to learn as much as I can! Thank you for your help in advance!

Upvotes: 0

Views: 203

Answers (1)

The fourth bird
The fourth bird

Reputation: 163527

One option is to make the pattern more specific and match the exact words instead of using \w+ and .*?

Based on the example data, if you want to capture the values for the coordinates, the total energy, the scan number you could use 3 capturing groups:

-+\r?\n(\d{1,2}(?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)*(?:\r?\n\d{1,2}(?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)*)*)\r?\nEnergy[^\S\r\n]+\([^[()]+\)[^\S\r\n]+=[^\S\r\n]+([+-]?\d+\.\d+)\r?\nStep:[^\S\r\n]+(\d+)

Explanation

  • -+\r?\n
  • ( Caputure group 1
    • \d{1,2} Match 1-2 digits
    • (?: Non capture group
      • [^\S\r\n]+[+-]?\d+(?:\.\d+)? Repeat 1+ spaces and a digit with an optional decimal part
    • )* Close group and repeat 0+ times
    • (?: Non capture group
      • \r?\n\d{1,2} Match a newline and 1-2 digits
      • (?:[^\S\r\n]+[+-]?\d+(?:\.\d+)?)* Repeat 0+ times matching spaced followed by a digit with an optional decimal part
    • )* Close group and repeat 0+ times
  • ) Close group 1
  • \r?\nEnergy[^\S\r\n]+\([^[()]+\)[^\S\r\n]+=[^\S\r\n]+
  • ( Capture group 2
    • [+-]?\d+\.\d+ Match optional - or + and 1+ digit with a decimal part
  • ) Close group 2
  • \r?\nStep:[^\S\r\n]+
  • (\d+) Capture group 3, match 1+ digits

Regex demo

Note that \s could also match a newline. To match whitespace chars without a newline, you could use a negated character class [^\S\r\n]

Upvotes: 1

Related Questions