Reputation: 361
How can I get just a part of XML node text?
I have this piece of XML:
<CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
<CorpusLink >../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
<CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>
I need to extract only this piece of text in each one:
../Metadata
../desano-silva-0151/Metadata
I have this code :
$j = 0
$TrgContent.METATRANSCRIPT.Corpus.CorpusLink | ForEach-Object {
[String]$_.'#text'= % {$alltext[$j] + "xml" $j++}}
But it gives me all the text:
../Metadata/A_short_autobiography_of_Herculino_Alves.xml
../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml
Thanks in advance for any help.
Upvotes: 1
Views: 331
Reputation: 1849
To achieve what you have asked. I think we have two main steps here:
I'm not really familiar with your existing scripts so I will explain all two steps here. The first step is optional to you.
My example XML document:
<Corpus>
<CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
<CorpusLink>../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
<CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>
</Corpus>
PS script to get the content:
[xml] $XmlDocument = Get-Content D:\Path_To_Your_File
$XmlDocument.Corpus.CorpusLink # Content of the nodes you need
There are many methods but I think I will go with regex. Simply loop through all the contents and run the regex.
$XmlDocument2.Corpus.CorpusLink | Foreach-Object {
if ($_ -match "\.\.\/.*?\/") {
$Matches.Values
}
}
About the regex, it matches any character except for line terminators between ..\
and /
:
\.\. # Escape for 2 dots `..`
\/ # Escapefor slash `/`
.*? # Takes any character except for line terminators in between other listed characters (above and below)
\/ # Escape for slash `/`
I imply the structure of these strings is stable like that, hence the regex.
Upvotes: 1