Reputation: 153

Regex to extract variable substring

I really have tried to solve this myself but have been bashing my head against a brick wall with this one.

I have a file with many rows like this:-

<outputColumn id="426" name="Net Salary per month € (3rd Applicant)" description="" lineageId="426" precision="0" scale="0" length="255" dataType="wstr" codePage="0" sortKeyPosition="0" comparisonFlags="0" specialFlags="0" errorOrTruncationOperation="Conversion" errorRowDisposition="FailComponent" truncationRowDisposition="FailComponent" externalMetadataColumnId="425" mappedColumnId="0"/>

I want a regexp to return just the string between the name=" and the next "

In this case, it's 'Net Salary per month € (3rd Applicant)' but it could be anything. That's what I meant by extracting a variable substring.

Thanks in advance.

Upvotes: 2

Answers (4)

mklement0

Reputation: 437082

There are helpful regexes in the existing answers; using one with the -replace operator allows you to extract the information of interest in a single operation:

$line = '<outputColumn id="426" name="Net Salary per month € (3rd Applicant)" description="" lineageId="426" precision="0" scale="0" length="255" dataType="wstr" codePage="0" sortKeyPosition="0" comparisonFlags="0" specialFlags="0" errorOrTruncationOperation="Conversion" errorRowDisposition="FailComponent" truncationRowDisposition="FailComponent" externalMetadataColumnId="425" mappedColumnId="0"/>'

# Extract the "name" attribute value.
# Note how the regex is designed to match the *full line*, which is then
# replaced with what the first (and only) capture group, (...), matched, $1
$line -replace '^.+ name="([^"]*).+', '$1'

This outputs a string with verbatim content Net Salary per month € (3rd Applicant).

Taking a step back: Your sample line is a valid XML element, and it's always preferable to use a dedicated XML parser.

Parsing each line as XML will be slow, but perhaps you can parse the entire file, which offers a simple solution using PowerShell's property-based adaption of the XML DOM, via the [xml] type (System.Xml.XmlDocument):

$fileContent = @'
<xml>
<outputColumn id="426" name="Net Salary per month € (3rd Applicant)" description="" lineageId="426" precision="0" scale="0" length="255" dataType="wstr" codePage="0" sortKeyPosition="0" comparisonFlags="0" specialFlags="0" errorOrTruncationOperation="Conversion" errorRowDisposition="FailComponent" truncationRowDisposition="FailComponent" externalMetadataColumnId="425" mappedColumnId="0"/>
<outputColumn id="427" name="Net Salary per month € (4th Applicant)" description="" lineageId="426" precision="0" scale="0" length="255" dataType="wstr" codePage="0" sortKeyPosition="0" comparisonFlags="0" specialFlags="0" errorOrTruncationOperation="Conversion" errorRowDisposition="FailComponent" truncationRowDisposition="FailComponent" externalMetadataColumnId="425" mappedColumnId="0"/>
</xml>
'@

([xml] $fileContent).xml.outputColumn.name

The above yields the "name" attribute values across all <outputColumn> elements:

Net Salary per month € (3rd Applicant)
Net Salary per month € (4th Applicant)

Upvotes: 0

Sascha Kolberg

Reputation: 7152

As there are a lot of '"' characters after name you would probably have to use the lazy flag

try

^.*name=\"(.+?)\".*$

matches the whole line and should give you want you want within the group (.+?)

Upvotes: 0

Vineet Kumar Doshi

Reputation: 4860

This may help: Regex = name="(.*?)"

DEMO

https://regex101.com/r/uF4oY4/51

Let me know if it helps.

Upvotes: 2

vks

Reputation: 67968

(?<=name=")[^"]*

This should do it for you.See demo.

https://regex101.com/r/uF4oY4/50

If you dont have lookarounds then use

name="([^"]*)

and grab the group 1.

Upvotes: 2

Regex to extract variable substring

Answers (4)

Related Questions