Reputation: 83
I am trying to remove only the punctuation from my text data but leave the accented letters. I do not want to replace the accented letters with English equivalents. I cannot figure out how to adapt my existing code to allow for higher ascii characters.
while (input.hasNext()){
String phrase = input.nextLine();
String[] words = phrase.split(" ");
for(String word: words){
String strippedInput = word.replaceAll("[^0-9a-zA-Z\\s]", "");
}
}
If original input is: O sal, ou o sódio, também é contraindicado em pacientes hipotensos?
Expected output should be: O sal ou o sódio também é contraindicado em pacientes hipotensos
Any ideas? Thanks!
Upvotes: 2
Views: 1241
Reputation: 62015
Consider using Unicode Categories, as "A-Z" is very English-centric and doesn't even cope with accents as discovered.
For example, the following would replace everything, including punctuation, except "any letter, any language" (\p{L}
) or "whitespace" (\s
). If it is desired to keep digits, add them back in as additional exclusions.
replaceAll("[^\\p{L}\\s]", "")
Here is an ideone demo.
Upvotes: 5
Reputation: 1175
Try this.
public class punctuationRemove {
//private static String punc = "[][(){},.;!?<>%]";
static StringBuilder sb = new StringBuilder();
static char[] punc = "',.;!?(){}[]<>%".toCharArray();
public static void main(String[] args){
String s = "Hello!, how are you?";
System.out.println(removePuntuation(s));
}
public static String removePuntuation(String s)
{
String tmp;
boolean fl=true;
for(int i=0;i<s.length();i++)
{
fl=true;
char strChar=s.charAt(i);
for (char badChar : punc)
{
if (badChar == strChar)
{
fl=false;
break;
}
}
if(fl)
{
sb.append(strChar);
}
}
return sb.toString();
}
}
Upvotes: 2
Reputation: 5946
replace a-zA-Z in regex string with \p{L} (any kind of letter from any language)
while (input.hasNext()){
String phrase = input.nextLine();
String[] words = phrase.split(" ");
for(String word: words){
String strippedInput = word.replaceAll("[^0-9\\p{L}\\s]", "");
}
}
Upvotes: 4
Reputation: 347332
Maybe I'm missing the point, but something like...
String text = "O sal, ou o sódio, também é contraindicado em pacientes hipotensos?";
System.out.println(text);
System.out.println(text.replaceAll("[\\?,.:!\\(\\){}\\[\\]<>%]", ""));
Outputs
O sal, ou o sódio, também é contraindicado em pacientes hipotensos?
O sal ou o sódio também é contraindicado em pacientes hipotensos
Or, based on your example...
while (input.hasNext()){
String phrase = input.nextLine();
String[] words = phrase.split(" ");
for(String word: words){
String strippedInput = word.replaceAll("[\\?,.:!\\(\\){}\\[\\]<>%]", "");
}
}
Upvotes: 1
Reputation: 16730
It may be inefficient, and I'm sure the idea can be improved upon, but you could create a method that loops through the string, building a buffer of each character that is not punctuation.
private String replacePunctuation(String s){
String output = "";
for(int i = 0; i < s.Length(); i++){
if(s.charAt(i) != '.' && s.charAt(i) != ',' && s.charAt(i) != '!') // Add other punctuation values you're concerned about. Perhaps the Regex class would be useful here, but I am not as familiar with it as I would like.
output += s.charAt(i);
}
}
}
Again, probably not the cleanest or most efficient, but it's the best I can come up with at the moment.
Upvotes: 0