user3762977
user3762977

Reputation: 691

Ant task to delete occurrences of string after the first one

I want to delete any multiple occurrences of strings in a text file, leaving only the first instance.

Starting point:

<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>

Desired result:

<topichead navtitle="AAAA"><topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>
                           <topicref href="____"/></topichead>

I'll have to get rid of most of the instances of </topichead> after that, but I once I get the first part, those will be easy to match and delete

So based on something I saw on this page, I wrote this:

 <replaceregexp byline="false" flags="g">
     <regexp pattern="(&lt;topichead.*&gt;)(r?\n\1)+"/>
     <substitution expression="/1"/>
     <fileset dir=".">
     <include name="*.txt"/>
     </fileset>
   </replaceregexp> 

However it's not working. As a test, if I delete (r?\n\1)+ from the regexp pattern and just match all instances of (&lt;topichead.*&gt;) and simply replace it with XXX or whatever, that works. So I know things are linked up correctly. I've also tried just (\1)+ for that second group, but nothing is working so far for the goal above. Any ideas welcome.

UPDATE

Updating this with a better example, the one I gave was a little imprecise: what I need to do exactly is more like this:

Starting point:

<topichead navtitle="AAAA"><topicref href="XYZ"/></topichead>
<topichead navtitle="AAAA"><topicref href="ZYX"/></topichead>
<topichead navtitle="AAAA"><topicref href="XXYYZZ"/></topichead>
<topichead navtitle="AAAA"><topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB"><topicref href="ZZZYXZ"/></topichead>
<topichead navtitle="BBBB"><topicref href="yyYYZZXX"/></topichead>
<topichead navtitle="BBBB"><topicref href="XX"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYZ"/></topichead>
<topichead navtitle="CCCC"><topicref href="ZZY"/></topichead>
<topichead navtitle="CCCC"><topicref href="XXZZY></topichead>
<topichead navtitle="CCCC"><topicref href="ZZZ"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYYZZXX"/></topichead>

Desired result:

<topichead navtitle="AAAA">
<topicref href="XYZ"/>
<topicref href="ZYX"/>
<topicref href="XXYYZZ"/>
<topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB">
<topicref href="ZZZYXZ"/>
<topicref href="yyYYZZXX"/>
<topicref href="XX"/>
<topicref href="YYZ"/></topichead>
<topichead navtitle="CCCC">
<topicref href="ZZY"/>
<topicref href="XXZZY>
<topicref href="ZZZ"/>
<topicref href="YYYZZXX"/></topichead>

The "XXYYZZ" are links that are all different (or may be different) and need to be preserved.

The hard part is getting rid of the duplicates after the first instance of, for example <topichead navtitle="AAAA">

If I could get to this result, as a first step:

<topichead navtitle="AAAA"><topicref href="XYZ"/></topichead>
                           <topicref href="ZYX"/></topichead>
                           <topicref href="XXYYZZ"/></topichead>
                           <topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB"><topicref href="ZZZYXZ"/></topichead>
                           <topicref href="yyYYZZXX"/></topichead>
                           <topicref href="XX"/></topichead>
<topichead navtitle="CCCC"><topicref href="YYZ"/></topichead>
                           <topicref href="ZZY"/></topichead>
                           <topicref href="XXZZY></topichead>
                           <topicref href="ZZZ"/></topichead>
                           <topicref href="YYYZZXX"/></topichead>

Then I can get rid of the unwanted trailing </topichead> entries easily, using this:

 <replaceregexp byline="false" flags="gs">
 <regexp pattern="&lt;/topichead&gt;\r\n&lt;topicref"/>
 <substitution expression="${line.separator}&lt;topicref"/>
 <fileset dir=".">
 <include name="*.txt"/>
 </fileset>
 </replaceregexp> 

...and get the desired result shown above.

I'm doing it that way now, using search and replace for the first step, then following up with that replaceregexp. I have lots of long lists of these to do though, so it would be great to automate it all.

I've looked at many suggestions that were all essentially variations of using this as the core (\r?\n\1) , in different ways but no luck getting anything that does what I need, yet.

Upvotes: 1

Views: 790

Answers (2)

jawee
jawee

Reputation: 271

After your update, I got your ponit. It seems one line of your original input:

<topichead navtitle="CCCC"><topicref href="XXZZY></topichead>

is likely to be :

<topichead navtitle="CCCC"><topicref href="XXZZY"/></topichead>

Then, the solution bould be as below:

    <target name="test2">
        <replaceregexp byline="false" flags="gs">
     <regexp pattern="(&lt;topichead\s+navtitle=&quot;[^&quot;]*&quot;&gt;)(&lt;topicref\s+href=&quot;[^&quot;]*&quot;/&gt;)&lt;/topichead&gt;(?=.*\1)"/>
     <substitution expression="\2"/>
     <fileset dir=".">
        <include name="*.txt"/>
     </fileset>
   </replaceregexp> 
    </target>

    <target name="test" depends="test2">
        <replaceregexp byline="false" flags="gs">
     <regexp pattern="(&lt;topicref.*?)(&lt;topichead\s+navtitle=&quot;[^&quot;]*&quot;&gt;)(&lt;topicref\s+href=&quot;[^&quot;]*&quot;/&gt;&lt;/topichead&gt;)"/>
     <substitution expression="\2${line.separator}\1\3"/>
     <fileset dir=".">
        <include name="*.txt"/>
     </fileset>
   </replaceregexp> 
    </target>

After you run ant test:
you'll get your desired results as below:

<topichead navtitle="AAAA">
<topicref href="XYZ"/>
<topicref href="ZYX"/>
<topicref href="XXYYZZ"/>
<topicref href="YYYY"/></topichead>
<topichead navtitle="BBBB">
<topicref href="ZZZYXZ"/>
<topicref href="yyYYZZXX"/>
<topicref href="XX"/></topichead>
<topichead navtitle="CCCC">
<topicref href="YYZ"/>
<topicref href="ZZY"/>
<topicref href="XXZZY"/>
<topicref href="ZZZ"/>
<topicref href="YYYZZXX"/></topichead>

Upvotes: 1

jawee
jawee

Reputation: 271

One Sample:

   <replaceregexp byline="false" flags="g">
     <regexp pattern="(&lt;topichead.*&gt;)(?=\r?\n\1)"/>
     <substitution expression="&lt;topicref href=&quot;____&quot;/&gt;&lt;/topichead&gt;"/>
     <fileset dir=".">
        <include name="*.txt"/>
     </fileset>
   </replaceregexp>

The output is like this:

<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="AAAA"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="BBBB"><topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topicref href="____"/></topichead>
<topichead navtitle="CCCC"><topicref href="____"/></topichead>

The result is leaving only the last instance, not the first one. FYI.

Upvotes: 0

Related Questions