Reputation: 753

Help with Regex. Need to extract `<A HREF`

i have <A HREF="f110111.ZIP"> and f110111 - is an arbitrary char sequence. I need C# regex match expression to extract all above.

E. g. input is

<A HREF="f110111.ZIP"><A HREF="qqq.ZIP"><A HREF="gygu.ZIP">

I want the list:

f110111.ZIP
qqq.ZIP
gygu.ZIP

Upvotes: 1

Answers (5)

321X

Reputation: 3185

I think Regular Expressions are a great way to filter text out of a given text.

This regex gets the File, Filename and Extension from the given text.

href="(?<File>(?<Filename>.*?)(?<Ext>\.\w{1,3}))"

Regex above expects an extension that exists out of word characters a-z A-Z 0-9, between 1 and 3 characters.

C# Code sample:

string regex = "href=\"(?<File>(?<Filename>.*?)(?<Ext>\\.\\w{1,3}))\"";
RegexOptions options = ((RegexOptions.IgnorePatternWhitespace | RegexOptions.Multiline) | RegexOptions.IgnoreCase);
Regex reg = new Regex(regex, options);

Upvotes: 0

vbence

Reputation: 20333

If you can have multiple dots in the filename:

<A HREF="(^["]+?).zip

If you do not have dots in the filename (just one before the zip), you can use a faster one:

<A HREF="(^[".]+)

C# example:

Pattern pattern = Pattern.compile("<A HREF=\"(^[\"]+?).zip");

Matcher matcher = pattern.matcher(buffer);
while (matcher.find()) {
    // do something with: matcher.group(1)
}

Upvotes: 2