Greenhorn
Greenhorn

Reputation: 1821

Text manipulation via java

I am supposed to read a text file via Java and blank out all the e-mail ids and URLs in the text file. This is to reduce noise in the data.

Are there any library functions in java to do the same?

Upvotes: 0

Views: 104

Answers (3)

RokL
RokL

Reputation: 2812

String.replace() takes a regex and replacement string (in your case ""). Use regex for email fields and urls to accomplish this task.

Upvotes: 0

bmargulies
bmargulies

Reputation: 100050

Typically in an NLP system the text will be tokenized, and dealing with URLs or email addresses is just one case of reducing low-frequency tokens to placeholders to reduce data sparsity. Assuming that the tokenization is competent to keep each item in one token, it's easier to replace tokens -- in just the same way that you might replace all words that occur less than some threshold with a placeholder.

Further, you might want to apply Baum-Welch to this whole business.

Upvotes: 0

Kylar
Kylar

Reputation: 9334

You can read the file in using a FileInputStream and/or a BufferedReader. You can parse each line and use a regex to see if there are any matches for email or URL patterns, and create a new output string or stream to write them out.

Show us what you've tried and your current code.

As an addendum, I've used these: http://www.regular-expressions.info/email.html http://daringfireball.net/2009/11/liberal_regex_for_matching_urls

With varying degrees of success.

Upvotes: 2

Related Questions