Reputation: 9888
I just used iTextSharp to get all the text from a pdf, and now I need to split that text into words. I used to use Acrobat library, which automatically divided it into words (using getPageNthWord()
).
I don't know which criteria used, but now I need to know how to split the text into words. I will split text in different languages, so I need to split to every possible separator char.
I saw the method Char.IsSeparator()
but using that mean looping for every char, which will be innefficient.
What I've got so far is manually specify the separators to use in the .Split()
:
separators = " .,;:-(){}[]/\'""?¿!¡" & Convert.ToChar(9) & NewLine()
There is some place to retrieve the common separator chars?
Upvotes: 2
Views: 1943
Reputation: 7475
You can use string.Split method with null parameter:
If the separator parameter is null or contains no characters, white-space characters are assumed to be the delimiters. White-space characters are defined by the Unicode standard and return true if they are passed to the Char.IsWhiteSpace method.
Or you can follow MSDN sample and get all char.IsSeparator()
characters.
Upvotes: 2