Zack
Zack

Reputation: 2869

Parsing a string to extract a URL or folder path

I asked a similar question recently about using regex to retrieve a URL or folder path from a string. I was looking at this comment by Dour High Arch, where he says:

"I recommend you do not use regexes at all; use separate code paths for URLs, using the Uri class, and file paths, using the FileInfo class. These classes already handle parsing, matching, extracting components, and so on."

I never really tried this, but now I am looking into it and can't figure out if what he said actually is useful to what I'm trying to accomplish.

I want to be able to parse a string message that could be something like:

"I placed the files on the server at http://www.thewebsite.com/NewStuff, they can also be reached on your local network drives at J:\Downloads\NewStuff"

And extract out the two strings http://www.thewebsite.com/ and J:\Downloads\NewStuff. I don't see any methods on the Uri or FileInfo class that parse a Uri or FileInfo object from a string like I think Dour High Arch was implying.

Is there something I'm missing about using the Uri or FileInfo class that will allow this behavior? If not is there some other class in the framework that does this?

Upvotes: 1

Views: 2165

Answers (4)

Michael Dyck
Michael Dyck

Reputation: 2413

Try \w+:\S+ and see how well that fits your purposes.

Upvotes: -1

Sedecimdies
Sedecimdies

Reputation: 152

U can use :

(?<type>[^ ]+?:)(?<path>//[^ ]*|\\.+\\[^ ]*)

that will give you 2 groups on each result

type : "http:"

path : //www.thewebsite.com/NewStuff

and

type : "J:"

path : \Downloads\NewStuff

out of the string

"I placed the files on the server at http://www.thewebsite.com/NewStuff, they can also be reached on your local network drives at J:\Downloads\NewStuff"

you can use the "type" group to see if the type is http:or not and set action on that.


EDIT

or use regex below if you are sure there is no whitespace in your filepath :

(?<type>[^ ]+?:)(?<path>//[^ ]*|\\[^ ]*)

Upvotes: 1

Dour High Arch
Dour High Arch

Reputation: 21712

It was not clear from your earlier question that you wanted to extract URL and file path substrings from larger strings. In that case, neither Uri.IsWellFormedUriString nor rRegex.Match will do what you want. Indeed, I do not think any simple method can do what you want because you will have to define rules for ambiguous strings like httX://wasThatAUriScheme/andAre/these part/of/aURL or/are they/separate.strings?andIsThis%20a%20Param?

My suggestion is to define a recursive descent parser and create states for each substring you need to distinguish.

Upvotes: 1

CSharpie
CSharpie

Reputation: 9467

I'd say the easiest way is splitting the strings into parts first.

First delimiter would be spaces, for each word - second would be qoutes (double and single)

Then use Uri.IsWellFormedUriString on each token.

So something like:

foreach(var part in String.Split(new char[]{''', '"', ' '}, someRandomText))
{
    if(Uri.IsWellFormedUriString(part, UriKind.RelativeOrAbsolute))
        doSomethingWith(part);

}

Just saw at URI.IseWellFormedURIString that this is a bit to strickt to suit your needs maybe. It returns false if www.Whatever.com is missing the http://

Upvotes: 1

Related Questions