skaeff
skaeff

Reputation: 753

Help with Regex. Need to extract `<A HREF`

i have <A HREF="f110111.ZIP"> and f110111 - is an arbitrary char sequence. I need C# regex match expression to extract all above.

E. g. input is

<A HREF="f110111.ZIP"><A HREF="qqq.ZIP"><A HREF="gygu.ZIP">

I want the list:

Upvotes: 1

Views: 764

Answers (5)

321X
321X

Reputation: 3185

I think Regular Expressions are a great way to filter text out of a given text.

This regex gets the File, Filename and Extension from the given text.

href="(?<File>(?<Filename>.*?)(?<Ext>\.\w{1,3}))"

Regex above expects an extension that exists out of word characters a-z A-Z 0-9, between 1 and 3 characters.

C# Code sample:

string regex = "href=\"(?<File>(?<Filename>.*?)(?<Ext>\\.\\w{1,3}))\"";
RegexOptions options = ((RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline) | RegexOptions.IgnoreCase);
Regex reg = new Regex(regex, options);

Upvotes: 0

vbence
vbence

Reputation: 20333

If you can have multiple dots in the filename:

<A HREF="(^["]+?).zip

If you do not have dots in the filename (just one before the zip), you can use a faster one:

<A HREF="(^[".]+)

C# example:

Pattern pattern = Pattern.compile("<A HREF=\"(^[\"]+?).zip");

Matcher matcher = pattern.matcher(buffer);
while (matcher.find()) {
    // do something with: matcher.group(1)
}

Upvotes: 2

jerone
jerone

Reputation: 16861

Try this one:

/<a href="([^">]+.ZIP)/gi

Upvotes: 0

Jaapjan
Jaapjan

Reputation: 3385

What you need is the htmlagility pack/! That will allow you to read HTML in an easy manner and provide an easy way to retrieve links.

Upvotes: 3

Jude Cooray
Jude Cooray

Reputation: 19862

NO NO! Do not use Regex to parse HTML!

Try an XML Parser. Or XPath perhaps.

Upvotes: 0

Related Questions