How to recognize URLs using Stanford CoreNLP

Question

I am using Stanford CoreNLP to extract various types of information from a given document. I am trying to detect URL patterns and I can see that links beginning with http:// or https:// are recognized properly, but links beginning with ftp://, svn:// etc are broken at ':' and 'ftp' or 'svn' becomes a token instead of the complete link being recognized a token. Due to this, I am not able to use any regex for match. I know there is a way to tokenize words with whitespaces using tokenize.whitespace. Is there a way to suppress ':' tokenizing the URL so that the complete link is recognized as a token?

Christopher Manning · Accepted Answer

Unfortunately, there isn't an easy way to just add extra URL patterns, because, for speed reasons, the tokenizer is done as a compiled finite automaton, using JFlex. You can only do it by starting with PTBLexer.flex, editing it, making the new java file with JFlex, setting javac loose on it, etc. For future versions, we're game to add useful patterns that won't detract from accurate tokenization in other places. I've added "ftp", "svn", and "svn+ssh". Anything else you'd like? (You could also put in a pull request.)

How to recognize URLs using Stanford CoreNLP

Answers (1)

Related Questions