Reputation: 689
I write a regular expression that is
<w:p.*>\[.*content.*\].*</w:p>
It is working fine. But sometime matches non require tag.
I've found a string from word processing like as
<w:p w:rsidR=‘00E52FD7’ w:rsidRDefault=‘00341592’ w:rsidP=‘000307E7’><w:pPr><w:pStyle w:val=‘Heading1’/><w:contextualSpacing w:val=‘0’/><w:jc w:val=‘center’/></w:pPr><w:r><w:rPr><w:noProof/></w:rPr><w:drawing><wp:inline distT=‘0’ distB=‘0’ distL=‘0’ distR=‘0’ wp14:anchorId=‘4F64B28D’ wp14:editId=‘6522B16C’><wp:extent cx=‘1358306’ cy=‘1343025’/><wp:effectExtent l=‘0’ t=‘0’ r=‘0’ b=‘0’/><wp:docPr id=‘2’ name=‘Picture 2’ descr=‘N:\HUMAN RESOURCES\Logos\Rancho-Logo-Type-Black.png’/><wp:cNvGraphicFramePr><a:graphicFrameLocks xmlns:a=‘http://schemas.openxmlformats.org/drawingml/2006/main’ noChangeAspect=‘1’/></wp:cNvGraphicFramePr><a:graphic xmlns:a=‘http://schemas.openxmlformats.org/drawingml/2006/main’><a:graphicData uri=‘http://schemas.openxmlformats.org/drawingml/2006/picture’><pic:pic xmlns:pic=‘http://schemas.openxmlformats.org/drawingml/2006/picture’><pic:nvPicPr><pic:cNvPr id=‘0’ name=‘Picture 1’ descr=‘N:\HUMAN RESOURCES\Logos\Rancho-Logo-Type-Black.png’/><pic:cNvPicPr><a:picLocks noChangeAspect=‘1’ noChangeArrowheads=‘1’/></pic:cNvPicPr></pic:nvPicPr><pic:blipFill><a:blip r:embed=‘rId7’ cstate=‘print’><a:extLst><a:ext uri=‘{28A0092B-C50C-407E-A947-70E740481C1C}’><a14:useLocalDpi xmlns:a14=‘http://schemas.microsoft.com/office/drawing/2010/main’ val=‘0’/></a:ext></a:extLst></a:blip><a:srcRect/><a:stretch><a:fillRect/></a:stretch></pic:blipFill><pic:spPr bwMode=‘auto’><a:xfrm><a:off x=‘0’ y=‘0’/><a:ext cx=‘1374505’ cy=‘1359042’/></a:xfrm><a:prstGeom prst=‘rect’><a:avLst/></a:prstGeom><a:noFill/><a:ln><a:noFill/></a:ln></pic:spPr></pic:pic></a:graphicData></a:graphic></wp:inline></w:drawing></w:r></w:p><w:p w:rsidR=‘00341592’ w:rsidRPr=‘00341592’ w:rsidRDefault=‘002F27D8’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Subtitle’/><w:contextualSpacing w:val=‘0’/><w:rPr><w:sz w:val=‘36’/><w:szCs w:val=‘36’/></w:rPr></w:pPr><w:r><w:t xml:space=‘preserve’>Job Description: </w:t></w:r><w:r w:rsidR=‘00360E41’><w:t>Irrigation/</w:t></w:r><w:r w:rsidR=‘004A20D0’><w:t>Maintenance Worker</w:t></w:r></w:p><w:p w:rsidR=‘000307E7’ w:rsidRDefault=‘000307E7’ w:rsidP=‘000307E7’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr><w:bookmarkStart w:id=‘0’ w:name=‘h.17ary2u5jp34’ w:colFirst=‘0’ w:colLast=‘0’/><w:bookmarkEnd w:id=‘0’/></w:p><w:p w:rsidR=‘00007B19’ w:rsidRDefault=‘00007B19’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr></w:p><w:p w:rsidR=‘00533338’ w:rsidRDefault=‘000307E7’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr><w:r><w:t xml:space=‘preserve’>Rancho has reviewed the duties described within this job description to ensure that essential functions and basic duties are included. It is not designed to cover or contain a comprehensive listing of activities, duties or responsibilities required of an incumbent. An incumbent may be asked to perform other duties as required or assigned by their supervisor. </w:t></w:r></w:p><w:p w:rsidR=‘00533338’ w:rsidRDefault=‘00533338’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr></w:p><w:p w:rsidR=‘00710D42’ w:rsidRDefault=‘00710D42’ w:rsidP=‘00341592’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr></w:p><w:p w:rsidR=‘004618DB’ w:rsidRDefault=‘004618DB’ w:rsidP=‘004618DB’><w:pPr><w:pStyle w:val=‘Normal1’/></w:pPr><w:r><w:t>[</w:t></w:r><w:proofErr w:type=‘gramStart’/><w:r><w:t>content</w:t></w:r><w:proofErr w:type=‘gramEnd’/><w:r><w:t>]</w:t></w:r></w:p>
My requirement is selecting <w:p>
tag which contains
[content]
But this expression matches extra <w:p>
tag which not contains my require text.
Any one can help me?
Upvotes: 1
Views: 106
Reputation: 626896
It is advisable to use an XML parser if you have an XML file to deal with. If you have this short fragment only, and you need it to do a one-off task, you may use either of the two regex approaches.
Extract all matches you want and check which one contains [content]
, and only return that substring:
Regex.Matches(s, @"(?s)<w:p\b[^>]*>(.*?)</w:p>")
.Cast<Match>()
.Where(x => x.Groups[1].Value.Contains("[content]"))
.Select(z => z.Value);
Note that here, (?s)<w:p\b[^>]*>(.*?)</w:p>
matches <w:p
, then asserts there is no word char immediately to the right with a \b
word boundary, then matches the rest of the element by consuming 0+ chars other than >
and then >
, then it captures any 0+ chars, as few as possible, into Group 1 (x.Groups[1].Value
) and finally matches </w:p>
. The .Where(x => x.Groups[1].Value.Contains("[content]"))
condition only keeps those that contain [content]
in the inner XML part of the w:p
element.
Use a more sophisticated regex with a tempered greedy token:
(?s)<w:p\b[^>]*>(?:(?!<w:p\b).)*?\[content].*?</w:p>
Details
(?s)
- a RegexOptions.Singleline
inline option<w:p
- a <w:p
substring\b
- word boundary[^>]*
- 0+ chars other than >
>
- a >
(?:(?!<w:p\b).)*?
- any char, 0+ times but as few as possible, that is not a starting point for <w:p
followed with a word boundary sequence\[content]
- a [content]
substring.*?
- any 0+ chars, as few as possible</w:p>
- a literal </w:p>
substringUpvotes: 1