user19633338
user19633338

Reputation:

Perl regex combining capture groups & nth string

I have files like the following:

<div title="alpha" Mauris eu justo sed nisi aliquet blandit. <span name="ll">beta</span> Fusce in pharetra nisi. <span name="ll">gamma</span> Aliquam vehicula imperdiet turpis et rhoncus. <span name="ll">delta</span> Donec faucibus augue quis neque dictum, at rutrum dolor placerat.</div>

I try to get the content of nth name="ll" attribute in place of title= content while preserving the order of the rest.

For example, the 2nd name="ll" would get me:

<div title="gamma" Mauris eu justo sed nisi aliquet blandit. <span name="ll">beta</span> Fusce in pharetra nisi. Aliquam vehicula imperdiet turpis et rhoncus. <span name="ll">delta</span> Donec faucibus augue quis neque dictum, at rutrum dolor placerat.</div>

Etcetera.


My try:

find . -type f -exec perl -pi -w -e 's/(title=)"?[^"\s]*"?(.*)((?:.*?\h+class="ll">){1}.*?)\h+class="ll">"?([^"\s]+)"?(<.*)/$1"$3"$2$4/' \{\} \;

Where do I make the mistake?

Upvotes: 2

Views: 78

Answers (2)

anubhava
anubhava

Reputation: 784998

This perl solution should work for you:

# matching 2nd <span name="ll">
perl -pe 's~(title=)"?[^"\s]*"?((?:.*?\h+<span name="ll">){1}.*?)\h+<span name="ll">([^<]+)</span>~$1"$3"$2~' file

<div title="gamma" Mauris eu justo sed nisi aliquet blandit. <span name="ll">beta</span> Fusce in pharetra nisi. Aliquam vehicula imperdiet turpis et rhoncus. <span name="ll">delta</span> Donec faucibus augue quis neque dictum, at rutrum dolor placerat.</div>

# matching 3rd <span name="ll">
perl -pe 's~(title=)"?[^"\s]*"?((?:.*?\h+<span name="ll">){2}.*?)\h+<span name="ll">([^<]+)</span>~$1"$3"$2~' file

<div title="delta" Mauris eu justo sed nisi aliquet blandit. <span name="ll">beta</span> Fusce in pharetra nisi. <span name="ll">gamma</span> Aliquam vehicula imperdiet turpis et rhoncus. Donec faucibus augue quis neque dictum, at rutrum dolor placerat.</div>

RegEx Explanation:

Explanation:

  • (title=): Match title= and capture in group #1
  • "?[^"\s]+"?: Match an optionally quoted non-space string
  • (: Start capture group #2
    • (?:: Start non-capture group
      • .*?: Match any text (lazy match)
      • \h+: Match 1+ whitespaces
      • <span name="ll">: Match text <span name="ll">
    • ){1}: End non-capture group and repeat this group {1} times
    • .*?: Match any text (lazy match)
  • ): End capture group #2
  • \h+: Match 1+ whitespaces
  • <span name="ll">: Match text <span name="ll">
  • ([^<]+): Match 1+ of any char that is not a > and capture in group #3
  • </span>: Match </span>
  • $1"$3"$2: Replacement part

Upvotes: 2

choroba
choroba

Reputation: 241808

Instead of doing everything in one substitution, proceed in steps:

perl -wpe '$n = 2;
           @m = /<span name="ll">([^<]+)/g;
           s/title="[^"]+"/title="$m[$n-1]"/;
           s:<span name="ll">\Q$m[$n-1]\E</span> ::;' 

i.e.

  1. extract all the strings that can be moved;
  2. replace the title by the wanted string;
  3. remove the span containing the wanted string.

Upvotes: 2

Related Questions