Reputation: 691
I want to delete any multiple occurrences of strings in a text file, leaving only the first instance.
Starting point:
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
Desired result:
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
I'll have to get rid of most of the instances of </topichead>
after that, but I once I get the first part, those will be easy to match and delete
So based on something I saw on this page, I wrote this:
<replaceregexp byline="false" flags="g">
<regexp pattern="(<topichead.*>)(r?\n\1)+"/>
<substitution expression="/1"/>
<fileset dir=".">
<include name="*.txt"/>
</fileset>
</replaceregexp>
However it's not working. As a test, if I delete (r?\n\1)+
from the regexp pattern and just match all instances of (<topichead.*>)
and simply replace it with XXX or whatever, that works. So I know things are linked up correctly. I've also tried just (\1)+
for that second group, but nothing is working so far for the goal above. Any ideas welcome.
UPDATE
Updating this with a better example, the one I gave was a little imprecise: what I need to do exactly is more like this:
Starting point:
<topichead navtitle="AAAA"><topicref href="XYZ"/></topichead>
<topichead navtitle="AAAA"><topicref href="ZYX"/></topichead>
<topichead navtitle="AAAA"><topicref href="XXYYZZ"/></topichead>
<topichead navtitle="AAAA"><topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB"><topicref href="ZZZYXZ"/></topichead>
<topichead navtitle="BBBB"><topicref href="yyYYZZXX"/></topichead>
<topichead navtitle="BBBB"><topicref href="XX"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYZ"/></topichead>
<topichead navtitle="CCCC"><topicref href="ZZY"/></topichead>
<topichead navtitle="CCCC"><topicref href="XXZZY></topichead>
<topichead navtitle="CCCC"><topicref href="ZZZ"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYYZZXX"/></topichead>
Desired result:
<topichead navtitle="AAAA">
<topicref href="XYZ"/>
<topicref href="ZYX"/>
<topicref href="XXYYZZ"/>
<topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB">
<topicref href="ZZZYXZ"/>
<topicref href="yyYYZZXX"/>
<topicref href="XX"/>
<topicref href="YYZ"/></topichead>
<topichead navtitle="CCCC">
<topicref href="ZZY"/>
<topicref href="XXZZY>
<topicref href="ZZZ"/>
<topicref href="YYYZZXX"/></topichead>
The "XXYYZZ" are links that are all different (or may be different) and need to be preserved.
The hard part is getting rid of the duplicates after the first instance of, for example <topichead navtitle="AAAA">
If I could get to this result, as a first step:
<topichead navtitle="AAAA"><topicref href="XYZ"/></topichead>
<topicref href="ZYX"/></topichead>
<topicref href="XXYYZZ"/></topichead>
<topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB"><topicref href="ZZZYXZ"/></topichead>
<topicref href="yyYYZZXX"/></topichead>
<topicref href="XX"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYZ"/></topichead>
<topicref href="ZZY"/></topichead>
<topicref href="XXZZY></topichead>
<topicref href="ZZZ"/></topichead>
<topicref href="YYYZZXX"/></topichead>
Then I can get rid of the unwanted trailing </topichead>
entries easily, using this:
<replaceregexp byline="false" flags="gs">
<regexp pattern="</topichead>\r\n<topicref"/>
<substitution expression="${line.separator}<topicref"/>
<fileset dir=".">
<include name="*.txt"/>
</fileset>
</replaceregexp>
...and get the desired result shown above.
I'm doing it that way now, using search and replace for the first step, then following up with that replaceregexp. I have lots of long lists of these to do though, so it would be great to automate it all.
I've looked at many suggestions that were all essentially variations of using this as the core (\r?\n\1)
, in different ways but no luck getting anything that does what I need, yet.
Upvotes: 1
Views: 790
Reputation: 271
After your update, I got your ponit. It seems one line of your original input:
<topichead navtitle="CCCC"><topicref href="XXZZY></topichead>
is likely to be :
<topichead navtitle="CCCC"><topicref href="XXZZY"/></topichead>
Then, the solution bould be as below:
<target name="test2">
<replaceregexp byline="false" flags="gs">
<regexp pattern="(<topichead\s+navtitle="[^"]*">)(<topicref\s+href="[^"]*"/>)</topichead>(?=.*\1)"/>
<substitution expression="\2"/>
<fileset dir=".">
<include name="*.txt"/>
</fileset>
</replaceregexp>
</target>
<target name="test" depends="test2">
<replaceregexp byline="false" flags="gs">
<regexp pattern="(<topicref.*?)(<topichead\s+navtitle="[^"]*">)(<topicref\s+href="[^"]*"/></topichead>)"/>
<substitution expression="\2${line.separator}\1\3"/>
<fileset dir=".">
<include name="*.txt"/>
</fileset>
</replaceregexp>
</target>
After you run ant test
:
you'll get your desired results as below:
<topichead navtitle="AAAA">
<topicref href="XYZ"/>
<topicref href="ZYX"/>
<topicref href="XXYYZZ"/>
<topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB">
<topicref href="ZZZYXZ"/>
<topicref href="yyYYZZXX"/>
<topicref href="XX"/></topichead>
<topichead navtitle="CCCC">
<topicref href="YYZ"/>
<topicref href="ZZY"/>
<topicref href="XXZZY"/>
<topicref href="ZZZ"/>
<topicref href="YYYZZXX"/></topichead>
Upvotes: 1
Reputation: 271
One Sample:
<replaceregexp byline="false" flags="g">
<regexp pattern="(<topichead.*>)(?=\r?\n\1)"/>
<substitution expression="<topicref href="____"/></topichead>"/>
<fileset dir=".">
<include name="*.txt"/>
</fileset>
</replaceregexp>
The output is like this:
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
The result is leaving only the last instance, not the first one. FYI.
Upvotes: 0