RatDon
RatDon

Reputation: 3543

python split string by multiple delimiters and/or combination of multiple delimiters

Input:

x = "121, 1238,\nxyz,\n 123abc \n\rabc123"

I want to split This string with the delimiters ",", "\n", "\r", "\s" to get the output

['121', '1238', 'xyz', '123abc', 'abc123']

Whatever I try, the delimiters are accepted as single characters and not as combination of characters. e.g.

1.

re.split("\n|,|\s|\r", x)

Gave output of

['121', '', '1238', '', 'xyz', '', '', '123abc', '', '', 'abc123']

  1. re.split("\n\s|,|\s|\r", x)

Gave output of

['121', '', '1238', '', 'xyz', '', '123abc', '', 'abc123']

The second one is a slight improvement over the first one. But if that's what is required, I need to give all possible combinations manually.
something Like (with more combinations):

re.split("\n\s|\s\n|\s\n\s|\n|,\s|\s,|\s,\s|,|\s|\r", x)

output:

['121', '1238', 'xyz', '', '123abc', '', 'abc123']

Is there any better way to do this?

Upvotes: 3

Views: 11833

Answers (2)

Roman Pavelka
Roman Pavelka

Reputation: 4191

Allow re.split to take as a delimiter 1 or more repetitions of any of your delimiting characters:

>>> re.split("[,\s]+", x)
['121', '1238', 'xyz', '123abc', 'abc123']

(The '*', '+', and '?' qualifiers are all greedy, they match as much as they can.)

Upvotes: 1

RatDon
RatDon

Reputation: 3543

Combining @Johnny Mopp's and @alfinkel24's comments:

re.split("[\s,]+",  x)

Will split the string as required to

['121', '1238', 'xyz', '123abc', 'abc123']

Explanation:

  • [...] any of the characters.
  • + one or more repetitions of the previous characters.
  • \s any white space characters including "\n, \r, \t"

    Official documentation:

\s
For Unicode (str) patterns: Matches Unicode whitespace characters (which includes [ \t\n\r\f\v], and also many other characters, for example the non-breaking spaces mandated by typography rules in many languages). If the ASCII flag is used, only [ \t\n\r\f\v] is matched.
For 8-bit (bytes) patterns: Matches characters considered whitespace in the ASCII character set; this is equivalent to [ \t\n\r\f\v].

Upvotes: 3

Related Questions