Reputation: 432
I have a list with a lot of page urls. I want to retrieve the unique websites.
"http://www.gadgetgiants.com/products/mica-8-inch-touchscreen-android-2-3-tablet-wifi-1-2ghz-cpu-flash10-3"
"http://www.malma.mx/products/pan-digital"
"http://www.gadgetgiants.com/products/snowpad-7-capacitive-multi-touch-screen-android-2-3-tabletwifi-samsung-cortex-a8-1-2ghz-cpu-camera-1080p-external-3g"
"http://www.spiritualityandwellness.com/products/internalized-motivation"
"http://www.spiritualityandwellness.com/products/evergreen-motivation"
Will result to:
www.gadgetgiants.com
www.malma.mx
www.spiritualityandwellness.com
Upvotes: 1
Views: 74
Reputation: 17360
egrep -o "www\.[a-zA-Z0-9.-]*\.[a-zA-Z]{2,4}" YOUR_FILE_NAME | sort -u
got the regex from here
(Edit) Example Usage and Output
$ cat ur.txt
"http://www.gadgetgiants.com/products/mica-8-inch-touchscreen-android-2-3"
"http://www.malma.mx/products/pan-digital"
"http://www.gadgetgiants.com/products/snowpad-7-capacitive-multi-touch"
"http://www.spiritualityandwellness.com/products/internalized-motivation"
"http://www.spiritualityandwellness.com/products/evergreen-motivation"
"http://www.swellness.com.au/products/evergreen-motivation"
$ egrep -o "www\.[a-zA-Z0-9.-]*\.[a-zA-Z]{2,4}" ur.txt | sort -u
www.gadgetgiants.com
www.malma.mx
www.spiritualityandwellness.com
www.swellness.com.au
Upvotes: 1
Reputation: 5884
Idea w/o regex:
Retrieve host from each address:
Uri uri = new Uri (yourLink);
string host = uri.Host;
Now you can just put all these hosts into HashSet or something.
Upvotes: 0