tirednconfused
tirednconfused

Reputation: 1

how to use java scanner and regex for removing punctuation marks in input text, but not "i.e."

How do i read this sentence and parse it using scanner to get the below output?

Input: "it is red i.e. RED. not read."

Output: it is red i.e. RED not read

i tried the below but it doesnt remove the periods at the end of the words:

Scanner lineReader = new Scanner(scanner.nextLine());
lineReader.useDelimiter(("\\s+(\\W*\\s)?"));

edit: let me change this requirement: how do i remove all punctuation marks from the input text but not when its a period (.) between two letters like i.e.

Upvotes: 0

Views: 4953

Answers (1)

Dunes
Dunes

Reputation: 40753

"(?<!i\\.e)\\.? |\\.$" should do the trick.

In English this regex says a delimiter is any of the following:

  • a space
  • a dot and a space (unless the dot is preceded by "i.e")
  • a dot and the end of the string.

With regards to your edit, try "((?<=\\s\\w{1,10})[^\\w\\s])?\\s|[^\\w\\s]$"

[^\\w\\s] means any character that is not a letter or a digit or whitespace (i.e. punctuation).

(?<=\\s\\w{1,10})[^\\w\\s])?\\s means a space that may be preceded by punctuation if there is no other punctuation before the next previous space. That is, it will not match the .[space] in e.g.[space] because there is a full stop between the e and the g. The lookbehind ((?<=\\s\\w{1,10})) is required to have a maximum length, and so may not use the zero-or-more or one-or-more operators (* and +). I put an arbitary limit of 10 because I don't know of any words or abbreviations that contain punctuation and are more than a few characters.

edit: I tested the new regex on the it is red i.e. RED. not read. e.g. 1,2, done! and it produced:

  • it
  • is
  • red
  • i.e.
  • RED
  • not
  • read
  • e.g.
  • 1,2,
  • done

Upvotes: 1

Related Questions