Reputation: 155
I have this string,
irn
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626
but sometimes, it will be
irn
1b6d13bbbe6e0e4bd8e5d7619bf7997672bc42d1d2442b531a487f9061df2626
, or
irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626
or
irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672bc42d1d2442b531a487f9061df2626
Actually from apache tika, I am reading the contents of the pdf and getting the output, so i am using ,
re.findall(r'\w+',payload)
to pickup all the words and not any other character.
I am using this regex to match the above string ,
irn(\s+?)(\w+\s+?)(([a-zA-Z0-9]{64})|([a-zA-Z0-9\s+]{65}))
this is working fine for
irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626
irn
no
1b6d13bbbe6e0e4bd8e5d7619bf7997672bc42d1d2442b531a487f9061df2626
but for this case :
irn
1b6d13bbbe6e0e4bd8e5d7619bf7997672
bc42d1d2442b531a487f9061df2626
the 2rd group is catching the 2nd line and the group 6 is catching the 3rd line and below subsequent lines till 64 characters.
Since it is not in my hands to maintain the data format in the pdf , can you please help me out here to fix this.
actually, the string will start from "irn", then there may or may not be some words, and then the irn number will be fixed 64 characters long.
Upvotes: 2
Views: 90
Reputation: 784998
You may use this regex with an optional match in 2nd line:
^irn[\r\n]+(?:(\w+)[\r\n]+)?([a-zA-Z0-9\r\n]{64,65})$
Explanation:
^irn[\r\n]+
: Match irn
followed by a 1+ newline characters(?:(\w+)[\r\n]+)?
: Optionally match 1+ word characters followed by 1+ line breaks and capture word in group #1([a-zA-Z0-9\r\n]{64,65})
: Match alphanumerical character or a line feed character 64 or 65 times. Capture this in group #2$
: EndUpvotes: 2