Reputation: 233
I have a huge text file, 20k+ lines, and I want to extract links from it.
What I need is a regular expression that generates a clean list of links.
The links i need start with http://
(without www
) and end with .html
What would the expression look like?
Upvotes: 3
Views: 6002
Reputation: 5660
Would look like this for global websites that end with .html pages:
(http|https)\://[a-zA-Z0-9\-\.]+\.[a-zA-Z]{2,}.+[a-zA-Z0-9\-\.].html
And to match exactly what you specified:
http\://[a-zA-Z0-9\-]+\.+[a-z]{2,}\/[a-zA-Z0-9\-]+.html
Just Ctrl+X and Ctrl+V in a new File and u got it.
Works for JavaScript and Notepad++ so on.
\b
is for word boundaries that searches whole words only so if there's just this word in the text like that: ewkgml http://test.com/a.html lamklwmwtmk
it will find it and \B
is the negation of it so wegniwgnwkjnhttp://test.com/a.htmllmwtlkmt34lt
will work too. |
is the or
statement.
Upvotes: 1
Reputation: 1337
In Notepad++ open the Replace Dialog
(CTRL+H) insert
.*?(http://.*?\.html).*?
in Find what:
input field and
$1\n
in Replace with:
input field
You have to check the checkbox Regular Expression
and the chebox . match newline
After you have clicked Replace all
you get a list of all links - one per line
Upvotes: 1