Nikhil
Nikhil

Reputation: 621

RegEx works everywhere except in Pentaho RegEx Evaluation Step

I have a couple of RegEx that work on the online regex websites but not in Pentaho. Could you please help?

Here's the string:

:6585d0f0ba88767ac3b590f719596d864d73e9c1:

harmonicbalance/src/harmonicbalance/HarmonicBalanceFlowModel.cpp
harmonicbalance/src/harmonicbalance/HbFlutterModel.cpp
:8302994b565553c83a048b8905ae597349d99627:

emp/src/emp/PhasePairSingleParticleReynoldsNumber.h
emp/src/emp/TomiyamaDragCoefficientMethod.cpp
:9da194f17ec08bb20ad1be8df68b78ca137ab18a:

combustion/src/combustion/ReactingSpeciesTransportBasedModel.cpp
combustion/src/complexchemistry/TurbulentFlameClosure.cpp
:6a59f0be1e347a65e525e58742bb304639ea9bc4:

meshing/src/meshing/SurfaceMeshManipulation.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.cpp
physics/src/discretization/FvIndirectRegionInterfaceManager.h
physics/src/discretization/FvRepresentation.cpp
physics/src/discretization/FvRepresentation.h
:64b7f6d36b11b6cd94c20cad53463b7deef8c85a:

resourceclient/src/resourceclient/ResourcePool.cpp
resourceclient/src/resourceclient/ResourcePool.h
resourceclient/src/resourceclient/RestClient.cpp
resourceclient/src/resourceclient/RestClient.h
resourceclient/src/resourceclient/test/ResourcePoolTest.cpp

I would like to capture two groups. First group will extract all commit SHA1 and the other group would extract file names.

Below are the expressions I tried:

(?:^:([A-Za-z0-9]+):|(?!^)\G)\n+([A-Za-z/.-]+)

https://regex101.com/r/3IBkPz/1

^:(\w+):\s+((?:\s*(?!:)[^\s]+)+)

https://regex101.com/r/oIoDvM/1

Thoughts?

Upvotes: 0

Views: 4746

Answers (1)

jxc
jxc

Reputation: 13998

AFAIK (as of PDI-8.0), the Regex Evaluation step does NOT support the regex 'g' modifier, your regex pattern must cover all the text to be able to make a match.

For example: the following pattern will not match anything in Regex Evaluation step:

:([0-9a-f]+):\s+([^:]+) 

but if I prepend .* to this pattern and pick "Enable dotall mode":

.*:([0-9a-f]+):\s+([^:]+)

it will match the last commit(sha1 + filenames). You can try move .* to the end of the original pattern which will get you the first commit. So if you want to retrieve the full list of commits(sha1 + filenames) with the g modifier, this step is probably not a solution for you.

As the fields are basically split by colons ':' and new lines, you can probably try the following approach:

  1. Use Split field to rows step, Delimiter=':' and include rownum in output, this rownum can be used to filter rows where even number is sha1 and odd number is filenames

  2. Use Analytic Query step to create a new field with LEAD = 1, so now you can get sha1 and filenames in the same row

  3. Use Calculator and Fileter step to calculate the remainer of rownum/2 and keep only rows with the odd number of rownum

  4. Use Split fields to rows again to split filenames to filename using "\n"(Delimiter is a Regular Expression). you might want to filter out the EMPTY filename, since the delimiter only support one char

Upvotes: 2

Related Questions