I need a command line to count the number of characters between a pattern repeated in a XML file

I have a file (XML) and I need to count the number of characters between a pattern (tag) and it pattern is repeated in the file.

The pattern is:

<controlfield tag="001">

Example XML file content:

<datafield tag="650" ind1="0" ind2="4">
   <subfield code="a">xxx</subfield>
   <subfield code="x">sdf</subfield>
 </datafield>
 <datafield tag="650" ind1="0" ind2="4">
   <subfield code="a">fff</subfield>
 </datafield>
 <datafield tag="650" ind1="0" ind2="4">
   <subfield code="a">asdfaf</subfield>
   <subfield code="x">fdfdf</subfield>
   <subfield code="x">dfdfdf</subfield>
 </datafield>
<controlfield tag="001">000000355</controlfield>
<datafield tag="909" ind1=" " ind2=" ">
  <subfield code="a">AGR01</subfield>
  <subfield code="b">ph</subfield>
  <subfield code="c">AGRP</subfield>
</datafield>
<datafield tag="910" ind1=" " ind2=" ">
  <subfield code="a">AGR</subfield>
</datafield>
<controlfield tag="001">000000358</controlfield>
<datafield tag="590" ind1=" " ind2=" ">
  <subfield code="a">19. dfsdfs em 2015</subfield>
  <subfield code="w">CECLI</subfield>
</datafield>
<datafield tag="650" ind1="0" ind2="4">
  <subfield code="a">Topografia</subfield>
</datafield>
<controlfield tag="001">000000365</controlfield>

I read https://unix.stackexchange.com/questions/295332/i-need-the-counts-of-lines-between-two-matching-patterns and try:

sed -n '/tag="001"/,/tag="001"/p' file.xml | wc -l

But only one counter was printed.

I need a counter for each pattern occurrence, in the above example I need 3 counters:

  1. number of characters before

    <controlfield tag="001">000000355</controlfield>
    
  2. number of characters between

    <controlfield tag="001">000000355</controlfield>
    

    and

    <controlfield tag="001">000000358</controlfield>
    
  3. number of characters between

    <controlfield tag="001">000000358</controlfield>
    

    and

    <controlfield tag="001">000000365</controlfield>
    

Can you help me?

Upvotes: 0

Views: 44

Answers (1)

karakfa
karakfa

Reputation: 67507

with GNU awk

$ awk -v RS="<controlfield tag=\"001\">[0-9]+</controlfield>" '{print length()}' file

394
253
239
1

the last 1 is for the last line feed. You may want to remove the line feeds before the length is calculated.

Upvotes: 2

Related Questions