Reputation: 913
I've tried the answers I've found in SOF, but none supported here : https://regexr.com I essentially have an .OPML file with a large number of podcasts and descriptions. in the following format:
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
What regex I can use to so I can just get the title and the link:
Software Engineering Daily
Upvotes: 0
Views: 72
Reputation: 22817
There are many ways to go about this. The best way is likely using an XML parser. I would definitely read this post that discusses use of regex, especially with XML.
As you can see there are many answers to your question. It also depends on which language you are using since regex engines differ. Some accept backreferences, whilst others do not. I'll post multiple methods below that work in different circumstances/for different regex flavours. You can probably piece together from the multiple regex methods below which parts work best for you.
This method works in almost any regex flavour (at least the normal ones).
This method only checks against the attribute value opening and closing marks of "
and doesn't include the possibility for whitespace before or after the =
symbol. This is the simplest solution to get the values you want.
\b(text|xmlUrl)="[^"]*"
Similarly, the following methods add more value to the above expression
\b(text|xmlUrl)\s*=\s*"[^"]*" Allows whitespace around =
\b(text|xmlUrl)=(?:"[^"]*"|'[^']*') Allows for ' to be used as attribute value delimiter
As another alternative (following the comments below my answer), if you wanted to grab every attribute except specific ones, you can use the following. Note that I use \w
, which should cover most attributes, but you can just replace this with whatever valid characters you want. \S
can be used to specify any non-whitespace characters or a set such as [\w-]
may be used to specify any word or hyphen character. The negation of the specific attributes occurs with (?!text|xmlUrl)
, which says don't match those characters. Also, note that the word boundary \b
at the beginning ensures that we're matching the full attribute name of text
and not the possibility of other attributes with the same termination such as subtext
.
\b((?!text|xmlUrl)\w+)="[^"]*"
This method only works with regex flavours that allow backreferences. Apparently JGsoft applications, Delphi, Perl, Python, Ruby, PHP, R, Boost, and Tcl support single-digit backreferences. Double-digit backreferences are supported by JGsoft applications, Delphi, Python, and Boost. Information according this article about numbered backreferences from Regular-Expressions.info
This method uses a backreference to ensure the same closing mark is used at the start and end of the attribute's value and also includes the possibility of whitespace surrounding the =
symbol. This doesn't allow the possibility for attributes with no delimiter specified (using xmlUrl=http://softwareengineeringdaily.com/feed/podcast/
may also be valid).
\b(text|xmlUrl)\s*=\s*(["'])(.*?)\2
This method is the same as Method 2 but also allows attributes with no delimiters (note that delimiters are now considered to be space characters, thus, it will only match until the next space).
\b(text|xmlUrl)\s*=\s*(?:(["'])(.*?)\2|(\S*))
While Method 3 works, some people might complain that the attribute values might either of 2 groups. This can be fixed by either of the following methods.
Method 4.A
Branch reset groups are only possible in a few languages, notably JGsoft V2, PCRE 7.2+, PHP, Delphi, R (with PCRE enabled), Boost 1.42+ according to Regular-Expressions.info
This also shows the method you would use if backreferences aren't possible and you wanted to match multiple delimiters ("([^"])"|'([^']*)
)
\b(text|xmlUrl)\s*=\s*(?|"([^"]*)"|'([^']*)'|(\S*))
Method 4.B
Duplicate subpatterns are not often supported. See this Regular-Expresions.info article for more information
This method uses the J
regex flag, which allows duplicate subpattern names ((?<v>)
is in there twice)
\b(text|xmlUrl)\s*=\s*(?:(["'])(?<v>.*?)\2|(?<v>\S*))
<outline text="Software Engineering Daily" type="rss" xmlUrl="http://softwareengineeringdaily.com/feed/podcast/" htmlUrl="http://softwareengineeringdaily.com" />
Each line below represents a different group. New matches are separated by two lines.
text
Software Engineering Daily
xmlUrl
http://softwareengineeringdaily.com/feed/podcast/
I'll explain different parts of the regexes used in the Code section that way you understand the usage of each of these parts. This is more of a reference to the methods above.
"[^"]*"
This is the fastest method possible (to the best of my knowledge) to grabbing anything between two "
symbols. Note that it does not check for escaped backslashes, it will match any non-"
character between two "
. Whilst "(.*?)"
can also be used, it's slightly slower(["'])(.*?)\2
is basically shorthand for "(.*?)"|'(.*?)'
. You can use any of the following methods to get the same result:
(?:"(.*?)"|'(.*?)')
(?:"([^"])"|'([^']*)')
<-- slightly faster than line above(?|)
This is a branch reset group. When you place groups inside it like (?|(x)|(y))
it returns the same group index for both matches. This means that if x
is captured, it'll get group index of 1, and if y
is captured, it'll also get a group index of 1.Upvotes: 1
Reputation: 43169
For simple HTML
strings you might get along with
Url=(['"])(.+?)\1
Here, take group $2
, see a demo on regex101.com.
Obligatory: consider using a parser instead (see here).
Upvotes: 1