Paulo
Paulo

Reputation: 361

Powershell script to get a part of XML node's content

How can I get just a part of XML node text?

I have this piece of XML:

  <CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
  <CorpusLink >../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
  <CorpusLink >../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>

I need to extract only this piece of text in each one:

../Metadata

../desano-silva-0151/Metadata

I have this code :

$j = 0
$TrgContent.METATRANSCRIPT.Corpus.CorpusLink | ForEach-Object {
[String]$_.'#text'= % {$alltext[$j] + "xml" $j++}}

But it gives me all the text:

../Metadata/A_short_autobiography_of_Herculino_Alves.xml

../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml

Thanks in advance for any help.

Upvotes: 1

Views: 331

Answers (1)

Lam Le
Lam Le

Reputation: 1849

To achieve what you have asked. I think we have two main steps here:

  1. Extract the content of XML nodes.
  2. Trim the content and take what you need only.

I'm not really familiar with your existing scripts so I will explain all two steps here. The first step is optional to you.

Extract content of XML nodes

My example XML document:

<Corpus>
    <CorpusLink>../Metadata/A_short_autobiography_of_Herculino_Alves.xml</CorpusLink>
    <CorpusLink>../Metadata/Wordlist_and_phrases_-_modifiers.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml</CorpusLink>
    <CorpusLink>../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml</CorpusLink>
</Corpus>

PS script to get the content:

[xml] $XmlDocument = Get-Content D:\Path_To_Your_File
$XmlDocument.Corpus.CorpusLink # Content of the nodes you need

Trim the content

There are many methods but I think I will go with regex. Simply loop through all the contents and run the regex.

$XmlDocument2.Corpus.CorpusLink | Foreach-Object {
    if ($_ -match "\.\.\/.*?\/") {
        $Matches.Values
    }    
}

About the regex, it matches any character except for line terminators between ..\ and /:

\.\.  # Escape for 2 dots `..`
\/    # Escapefor slash `/`
.*?   # Takes any character except for line terminators in between other listed characters (above and below)
\/    # Escape for slash `/`

I imply the structure of these strings is stable like that, hence the regex.

Upvotes: 1

Related Questions