AknarTrebna
AknarTrebna

Reputation: 23

NiFi: Grabbing Multiple Regex Matches (Into an Attribute Using ExtractText?)

Sample of what I'm processing:

<doc_filename>file1.docx</doc_filename>
...other data...
<doc_filename>file2.ppx</doc_filename>
...other data...
...more doc_filenames...

I need to extract what is between <doc_filename></doc_filename>. My current attempt is using an ExtractText, with a regex string:

[<][d][o][c][_][f][i][l][e][n][a][m][e][>](.*<)[/][d][o][c][_][f][i][l][e][n][a][m][e][>].*

This works fine if there is only one <doc_filename>, but grabs far beyond the closing tag if not. I have done a lot of googling and I can't seem to find a way to do this. Am I missing something, or do I need to get a groovy script to do all of the processing here?

Note: I'm using these filenames later for further processing.

Thanks!

Upvotes: 1

Views: 2549

Answers (1)

Andy
Andy

Reputation: 14194

On the ExtractText processor, set Include Capture Group 0 to false, Enable repeating capture group to true, and provide a dynamic property (click the '+' on the top right) with the property name doc_filename (or anything) and value (?<=<doc_filename>)(.*?)(?=</doc_filename>).

The regex works as follows:

(?<=<doc_filename>) // Look-behind group to require opening tag
(.*?)               // Capture any characters, lazily
(?=</doc_filename>) // Look-ahead group to require closing tag

The resulting output (based on the example input you provided), will be:

2019-01-25 13:15:55,379 INFO [Timer-Driven Process Thread-5] o.a.n.processors.standard.LogAttribute LogAttribute[id=01681000-d047-1f22-14da-f93157703ba1] logging for flow file StandardFlowFileRecord[uuid=6908e84b-182d-4ffc-95e4-2efe5af00911,claim=StandardContentClaim [resourceClaim=StandardResourceClaim[id=1548450895799-1, container=default, section=1], offset=1233, length=137],offset=0,name=6908e84b-182d-4ffc-95e4-2efe5af00911,size=137]
--------------------------------------------------
Standard FlowFile Attributes
Key: 'entryDate'
    Value: 'Fri Jan 25 13:15:55 PST 2019'
Key: 'lineageStartDate'
    Value: 'Fri Jan 25 13:15:55 PST 2019'
Key: 'fileSize'
    Value: '137'
FlowFile Attribute Map Content
Key: 'doc_filename'
    Value: 'file1.docx'
Key: 'doc_filename.1'
    Value: 'file1.docx'
Key: 'doc_filename.2'
    Value: 'file2.ppx'
Key: 'filename'
    Value: '6908e84b-182d-4ffc-95e4-2efe5af00911'
Key: 'path'
    Value: './'
Key: 'uuid'
    Value: '6908e84b-182d-4ffc-95e4-2efe5af00911'
--------------------------------------------------
<doc_filename>file1.docx</doc_filename>
...other data...
<doc_filename>file2.ppx</doc_filename>
...other data...
...more doc_filenames...

Upvotes: 3

Related Questions