Reputation: 63
I'm trying to modify some texts using regex. This is the original text:
<text xml:lang="en">"Insert Swab to Start Analysis"</text>
<text xml:lang="es"></text>
<text xml:lang="fr"></text>
<text xml:lang="de"></text>
<text xml:lang="pt"></text>
<text xml:lang="du"></text>
And this is the desired text:
<en>"Insert Swab to Start Analysis"</en>
<es>"Insert Swab to Start Analysis"</es>
<fr>"Insert Swab to Start Analysis"</fr>
<de>"Insert Swab to Start Analysis"</de>
<pt>"Insert Swab to Start Analysis"</pt>
<du>"Insert Swab to Start Analysis"</du>
As you can see there have been two changes: modify the tags and copy the source text into the target languages.
I managed to do this using two different regex.
First regex (copy source text into target languages):
Search: (<text xml:lang=)"en">(.+?)(</text>)\r\n \1"es">\3\r\n \1"fr">\3\r\n \1"de">\3\r\n \1"pt">\3\r\n \1"du">\3
Replace: \1"en">\2\3\r\n \1"es">\2\3\r\n \1"fr">\2\3\r\n \1"de">\2\3\r\n \1"pt">\2\3\r\n \1"du">\2\3
Second regex (change tags):
Search: <text xml:lang="(en|es|fr|de|pt|du)">(.*?)(</[^>]*>)
Replace: <\1\>\2</\1>
I'm quite happy with the result but I'm wondering if all this can be done using a single regex and not two. The second regex I used is quite elegant but it does not copy the source text into the different target languages. I suspect it needs a little trick to work properly. Suggestions?
PD: I'm just using Notepad++ to do all this.
PD: It's a big XML file with many entries, not only the one I'm showing you here.
Upvotes: 1
Views: 75
Reputation: 626774
Only if the string is always formatted the same way, you may just amend the first regex to do the whole job for you:
Find What: (<text xml:lang=")en">(.+?)(</text>)\R \1es">\3\R \1fr">\3\R \1de">\3\R \1pt">\3\R \1du">\3
Replace With: <en>\2</en>\r\n <es>\2</es>\r\n <fr>\2</fr>\r\n <de>\2</de>\r\n <pt>\2</pt>\r\n <du>\2</du>
See the regex demo
Details
(<text xml:lang=")
- Group 1 (referred to with \1
): literal text <text xml:lang="
en">
- literal text en">
(.+?)
- Group 2: any 1 or more chars other than line break chars, as few as possible(</text>)
- Group 3: literal text </text>
\R
- any line break sequence
- two spaces\1
- the text captured in Group 1es">
- literal text es">
\3
- the text captured in Group 3\R \1fr">\3\R \1de">\3\R \1pt">\3\R \1du">\3
- this is already clear from the above description.Upvotes: 2