Reputation:
It's my first time asking for help here so pls don't eat me.
I have a really big and messy .xml file on my hands. Its structure goes like this:
<SPEAKER N°001>ERROR</SPEAKER N°001>
<ORIGINAL N°001>
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME1
</ORIGINAL N°001>
<JAPANESE N°001>
ツートンカラーの群れはグルグルと回り続け、
三方向から催眠動画を見せられているかのような錯覚に
陥る戦刃だが、それでも、彼女の表情は凍ったままだ。
</JAPANESE N°001>
<TRANSLATED N°001>
</TRANSLATED N°001>
<COMMENT N°001>
</COMMENT N°001>
------------------------------------------------------------
<SPEAKER N°002>ERROR</SPEAKER N°002>
<ORIGINAL N°002>
TEXT THAT INTERESTS ME2
</ORIGINAL N°002>
<JAPANESE N°002>
寧ろ、この異様な状況を前に、【超高校級の軍人】は
一際心が平静になりつつある。
</JAPANESE N°002>
<TRANSLATED N°002>
</TRANSLATED N°002>
<COMMENT N°002>
</COMMENT N°002>
------------------------------------------------------------
This repeats about hundred times. I need to isolate the text in <ORIGINAL N°number> tags and delete everything else, so the end result looks like this:
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME2
...
TEXT THAT INTERESTS ME254
I had an idea to use a macro and search/replace function, but I can't for the life of me get it to work. The file is too long to do it manually. I'm using notepad++, but let me know if it's easier to do with different program.
Also, sorry if this question is a duplicate.
Upvotes: 3
Views: 65
Reputation: 6490
Piece of cake in powershell :)
Also: Your XML is invalid.
powershell
D:\t.txt
with your path:(Get-Content D:\t.txt -Raw ) | Select-String -Pattern "(?smi)<ORIGINAL.*?>(.*?)</ORIGINAL" -AllMAtches | % {$_.Matches} | %{$_.Groups[1].ToString().Trim()}
My output was
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME2
Upvotes: 0
Reputation: 91498
Here is a way to go:
(?:(?!<ORIGINAL.+?>).)*<ORIGINAL.+?>\R*((?:(?!</ORIGINAL.+?>).)+)(?:</ORIGINAL.+?>(?:(?!<ORIGINAL.+?>).)*)
$1
. matches newline
Screen capture (before):
Screen capture (after):
Upvotes: 0