Green_goblin
Green_goblin

Reputation: 63

Regex, backreferences and alternations

I'm trying to modify some texts using regex. This is the original text:

  <text xml:lang="en">"Insert Swab to Start Analysis"</text>
  <text xml:lang="es"></text>
  <text xml:lang="fr"></text>
  <text xml:lang="de"></text>
  <text xml:lang="pt"></text>
  <text xml:lang="du"></text>

And this is the desired text:

  <en>"Insert Swab to Start Analysis"</en>
  <es>"Insert Swab to Start Analysis"</es>
  <fr>"Insert Swab to Start Analysis"</fr>
  <de>"Insert Swab to Start Analysis"</de>
  <pt>"Insert Swab to Start Analysis"</pt>
  <du>"Insert Swab to Start Analysis"</du>

As you can see there have been two changes: modify the tags and copy the source text into the target languages.

I managed to do this using two different regex.

First regex (copy source text into target languages):

Search: (<text xml:lang=)"en">(.+?)(</text>)\r\n  \1"es">\3\r\n  \1"fr">\3\r\n  \1"de">\3\r\n  \1"pt">\3\r\n  \1"du">\3
Replace: \1"en">\2\3\r\n  \1"es">\2\3\r\n  \1"fr">\2\3\r\n  \1"de">\2\3\r\n  \1"pt">\2\3\r\n  \1"du">\2\3

Second regex (change tags):

Search: <text xml:lang="(en|es|fr|de|pt|du)">(.*?)(</[^>]*>)
Replace: <\1\>\2</\1>

I'm quite happy with the result but I'm wondering if all this can be done using a single regex and not two. The second regex I used is quite elegant but it does not copy the source text into the different target languages. I suspect it needs a little trick to work properly. Suggestions?

PD: I'm just using Notepad++ to do all this.

PD: It's a big XML file with many entries, not only the one I'm showing you here.

Upvotes: 1

Views: 75

Answers (1)

Wiktor Stribiżew
Wiktor Stribiżew

Reputation: 626774

Only if the string is always formatted the same way, you may just amend the first regex to do the whole job for you:

Find What: (<text xml:lang=")en">(.+?)(</text>)\R \1es">\3\R \1fr">\3\R \1de">\3\R \1pt">\3\R \1du">\3
Replace With: <en>\2</en>\r\n <es>\2</es>\r\n <fr>\2</fr>\r\n <de>\2</de>\r\n <pt>\2</pt>\r\n <du>\2</du>

See the regex demo

Details

  • (<text xml:lang=") - Group 1 (referred to with \1): literal text <text xml:lang="
  • en"> - literal text en">
  • (.+?) - Group 2: any 1 or more chars other than line break chars, as few as possible
  • (</text>) - Group 3: literal text </text>
  • \R - any line break sequence
  • - two spaces
  • \1 - the text captured in Group 1
  • es"> - literal text es">
  • \3 - the text captured in Group 3
  • \R \1fr">\3\R \1de">\3\R \1pt">\3\R \1du">\3 - this is already clear from the above description.

Upvotes: 2

Related Questions