Reputation: 1821
I am supposed to read a text file via Java and blank out all the e-mail ids and URLs in the text file. This is to reduce noise in the data.
Are there any library functions in java to do the same?
Upvotes: 0
Views: 104
Reputation: 2812
String.replace() takes a regex and replacement string (in your case ""). Use regex for email fields and urls to accomplish this task.
Upvotes: 0
Reputation: 100050
Typically in an NLP system the text will be tokenized, and dealing with URLs or email addresses is just one case of reducing low-frequency tokens to placeholders to reduce data sparsity. Assuming that the tokenization is competent to keep each item in one token, it's easier to replace tokens -- in just the same way that you might replace all words that occur less than some threshold with a placeholder.
Further, you might want to apply Baum-Welch to this whole business.
Upvotes: 0
Reputation: 9334
You can read the file in using a FileInputStream and/or a BufferedReader. You can parse each line and use a regex to see if there are any matches for email or URL patterns, and create a new output string or stream to write them out.
Show us what you've tried and your current code.
As an addendum, I've used these: http://www.regular-expressions.info/email.html http://daringfireball.net/2009/11/liberal_regex_for_matching_urls
With varying degrees of success.
Upvotes: 2