Reputation: 1393
I'm trying to parse URLs from a file. My regex is working for 80% of the time but I need to modify it for exceptions. It's starting to get complicated and I would like to know how could I write a nice and clean regex for this input file to get host in one group and the URI part in a second.
Ex : http://stackoverflow.com/index.php
where stackoverflow.com
is the host and /index.php
is the URI.
Input file :
//cdn.sstatic.net/stackoverflow/img/favicon.ico
//cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png
/opensearch.xml
/
#
http://www.stackoverflow.com
http://www.stackoverflow.com/
http://stackoverflow.com/
http://careers.stackoverflow.com
aaa#aaa.com
aaa.com#aaa
aaa#aaa
#aaa
#
fakedomain/index.php
fakedomain.com/index.php
fakedomain.com/
/fakedomain.com/
/index.html/
index.html
Regex :
(?:.*?//)?(.*?)(/.*|$)
Result :
1 : //cdn.sstatic.net/stackoverflow/img/favicon.ico has 2 groups:
cdn.sstatic.net
/stackoverflow/img/favicon.ico
2 : //cdn.sstatic.net/stackoverflow/img/apple-touch-icon.png has 2 groups:
cdn.sstatic.net
/stackoverflow/img/apple-touch-icon.png
3 : /opensearch.xml has 2 groups:
/opensearch.xml
4 : / has 2 groups:
/
5 : http://www.stackoverflow.com has 2 groups:
http:
//www.stackoverflow.com
6 : http://www.stackoverflow.com/ has 2 groups:
www.stackoverflow.com
/
7 : http://stackoverflow.com/ has 2 groups:
stackoverflow.com
/
8 : http://careers.stackoverflow.com has 2 groups:
http:
//careers.stackoverflow.com
7 : fakedomain/index.php has 2 groups:
fakedomain
/index.php
8 : fakedomain.com/index.php has 2 groups:
fakedomain.com
/index.php
9 : fakedomain.com/ has 2 groups:
fakedomain.com
/
10 : /fakedomain.com/ has 2 groups:
/fakedomain.com/
11 : /index.html/ has 2 groups:
/index.html/
12 : index.html has 2 groups:
index.html
13 : has 2 groups:
C# regex tester : http://derekslager.com/blog/posts/2007/09/a-better-dotnet-regular-expression-tester.ashx
So how could I remove links with .ico
or .png
and adding some other fixes and also getting a nice and clean regex?
Upvotes: 2
Views: 103
Reputation: 1381
Regular expressions are a very flexible tool, but for any sort of standardized format, there is almost always a standard parser that does the job faster and better.
Use System.Uri (http://msdn.microsoft.com/en-us/library/system.uri.aspx) which will handle all of the corner cases for you.
Upvotes: 7