Powershell script to get a part of XML node's content

Question

How can I get just a part of XML node text?

I have this piece of XML:

  ../Metadata/A_short_autobiography_of_Herculino_Alves.xml
  ../Metadata/Wordlist_and_phrases_-_modifiers.xml
  ../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml
  ../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml
  ../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml
  ../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml

I need to extract only this piece of text in each one:

../Metadata

../desano-silva-0151/Metadata

I have this code :

$j = 0
$TrgContent.METATRANSCRIPT.Corpus.CorpusLink | ForEach-Object {
[String]$_.'#text'= % {$alltext[$j] + "xml" $j++}}

But it gives me all the text:

../Metadata/A_short_autobiography_of_Herculino_Alves.xml

../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml

Thanks in advance for any help.

Lam Le · Accepted Answer

To achieve what you have asked. I think we have two main steps here:

Extract the content of XML nodes.
Trim the content and take what you need only.

I'm not really familiar with your existing scripts so I will explain all two steps here. The first step is optional to you.

Extract content of XML nodes

My example XML document:


    ../Metadata/A_short_autobiography_of_Herculino_Alves.xml
    ../Metadata/Wordlist_and_phrases_-_modifiers.xml
    ../desano-silva-0151/Metadata/Wordlist_fruits_and_cultural_items.xml
    ../desano-silva-0151/Metadata/The_Turtle_and_the_Deer.xml
    ../desano-silva-0151/Metadata/Wordlist_and_phrases_parts_of_a_tree.xml
    ../desano-silva-0151/Metadata/Wordlist_and_phrases_.xml

PS script to get the content:

[xml] $XmlDocument = Get-Content D:\Path_To_Your_File
$XmlDocument.Corpus.CorpusLink # Content of the nodes you need

Trim the content

There are many methods but I think I will go with regex. Simply loop through all the contents and run the regex.

$XmlDocument2.Corpus.CorpusLink | Foreach-Object {
    if ($_ -match "\.\.\/.*?\/") {
        $Matches.Values
    }    
}

About the regex, it matches any character except for line terminators between ..\ and /:

\.\.  # Escape for 2 dots `..`
\/    # Escapefor slash `/`
.*?   # Takes any character except for line terminators in between other listed characters (above and below)
\/    # Escape for slash `/`

I imply the structure of these strings is stable like that, hence the regex.

Powershell script to get a part of XML node's content

Answers (1)

Extract content of XML nodes

Trim the content

Related Questions

Powershell script to get a part of XML node&#39;s content

Answers (1)

Extract content of XML nodes

Trim the content

Related Questions

Powershell script to get a part of XML node's content