Reputation: 13054
Using vim, I am attempting to remove all text outside of <text>
blocks. This needs to span across newlines and other (unrelated) tags.
I have attempted to use regex to substitute text for newlines, but failed for a couple of reasons, one of which was my attempts did not span multiple lines, and I need to have my matches be non-greedy. (Is that accomplished using {-}
somehow?)
The regex that should match the content I would like to delete would look like: <//text>.*<text.*>
but if I make this match non-greedy, I may have other issues. (I also realize I'll have one partial tag section to clean up at the beginning doing this.)
Is there another approach that I should be taking, or can someone guide me to remove all content not between such tags using vim?
EDIT: Including sample text
<contributor>
<username>MalafayaBot</username>
<id>628</id>
</contributor>
<minor />
<comment>Robô: A modificar Categoria:Vocábulo de étimo latino (Português) para Categoria:Entrada de étimo latino (Português)</comment>
<text xml:space="preserve">={{-pt-}}=
==Substantivo==
{{flex.pt|ms=excerto|mp=excertos}}
{{paroxítona|ex|cer|to}} {{m}}
# [[extrato]] de um [[texto]], [[fragmento]]
#: ''A seguir, um '''excerto''' do texto original.''
===Tradução===
{{tradini}}
* {{trad|es|extracto}}
* {{trad|fr|extrait}}
{{tradmeio}}
* {{trad|en|excerpt}}
{{tradfim}}
=={{etimologia|pt}}==
:Do latim ''[[excerptu]]'' (colhido de).
=={{pronúncia|pt}}==
===Brasil===
* [[SAMPA]]: /e."sEx.tu/
* [[AFI]]: /esˈertu/
[[zh:excerto]]</text>
<sha1>8i1zywj37s74ah4wnai11ohorfjn8j5</sha1>
<model>wikitext</model>
Upvotes: 2
Views: 490
Reputation: 191
if you don't NEED to you vim, you can try using this sed command, just replace "test" with the name of your file. I would test this on a COPY of your file first since the -i
option tells sed to modify the actual file you pass in.
sed -i 's/<\/text>[^<]*/<\/text>/g' test
EDIT: after seeing the sample, I'm going to take a different approach... instead of getting rid of all the text not within tags.. I'm going to select all the blocks and output it to a new file. Hopefully your version of grep supports the -P option. Try this:
grep -Pzo "(?s)<text.*?<\/text>" sample.txt > out.txt
Upvotes: 1
Reputation: 172648
Your struggles with regular expressions indicate that you're using the wrong tool for the job.
For text extraction from XML, you can use XSLT, which will handle all special cases far better than a regular expression. Or use special-purpose tools like xidel, a kind of grep for XML. With it, the extraction is as easy as:
xidel --extract "//text" input.xml
Upvotes: 2
Reputation: 195169
I assume that there is only one <text>
block in your file. In vim this line works for your sample text:
%s#\_.*\(<text.\{-}>\_.*</text>\)\_.*#\1#
Upvotes: 0