user12575606
user12575606

Reputation:

Isolating text in a large xml file

It's my first time asking for help here so pls don't eat me.

I have a really big and messy .xml file on my hands. Its structure goes like this:

<SPEAKER N°001>ERROR</SPEAKER N°001>
<ORIGINAL N°001>
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME1
</ORIGINAL N°001>
<JAPANESE N°001>
ツートンカラーの群れはグルグルと回り続け、
三方向から催眠動画を見せられているかのような錯覚に
陥る戦刃だが、それでも、彼女の表情は凍ったままだ。
</JAPANESE N°001>
<TRANSLATED N°001>

</TRANSLATED N°001>
<COMMENT N°001>

</COMMENT N°001>
------------------------------------------------------------
<SPEAKER N°002>ERROR</SPEAKER N°002>
<ORIGINAL N°002>
TEXT THAT INTERESTS ME2
</ORIGINAL N°002>
<JAPANESE N°002>
寧ろ、この異様な状況を前に、【超高校級の軍人】は
一際心が平静になりつつある。
</JAPANESE N°002>
<TRANSLATED N°002>

</TRANSLATED N°002>
<COMMENT N°002>

</COMMENT N°002>
------------------------------------------------------------

This repeats about hundred times. I need to isolate the text in <ORIGINAL N°number> tags and delete everything else, so the end result looks like this:

TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME2
...
TEXT THAT INTERESTS ME254

I had an idea to use a macro and search/replace function, but I can't for the life of me get it to work. The file is too long to do it manually. I'm using notepad++, but let me know if it's easier to do with different program.

Also, sorry if this question is a duplicate.

Upvotes: 3

Views: 65

Answers (2)

Samuel
Samuel

Reputation: 6490

Piece of cake in powershell :)

Also: Your XML is invalid.

  • Press CTRL + R
  • Type powershell
  • Press enter, a blue window appears. If not you are likely on linux :P
  • Type the line below, but replace D:\t.txt with your path:
(Get-Content D:\t.txt -Raw ) | Select-String -Pattern "(?smi)<ORIGINAL.*?>(.*?)</ORIGINAL" -AllMAtches | % {$_.Matches} | %{$_.Groups[1].ToString().Trim()}

My output was

TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME1
TEXT THAT INTERESTS ME2

Upvotes: 0

Toto
Toto

Reputation: 91498

Here is a way to go:

  • Ctrl+H
  • Find what: (?:(?!<ORIGINAL.+?>).)*<ORIGINAL.+?>\R*((?:(?!</ORIGINAL.+?>).)+)(?:</ORIGINAL.+?>(?:(?!<ORIGINAL.+?>).)*)
  • Replace with: $1
  • CHECK Match case
  • CHECK Wrap around
  • CHECK Regular expression
  • CHECK . matches newline
  • Replace all

Demo & explanation

Screen capture (before):

enter image description here

Screen capture (after):

enter image description here

Upvotes: 0

Related Questions