traggatmot
traggatmot

Reputation: 1463

split string in python when characters on either side of separator are not numbers

I have a large list of chemical data, that contains entries like the following:

1. 2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP
2. Lead,Paints/Pigments,Zinc

I have a function that is correctly splitting the 1st entry into: ['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']

based on ', ' as a separator. For the second entry, ', ' won't work. But, if i could easily split any string that contains ',' with only two non-numeric characters on either side, I would be able to parse all entries like the second one, without splitting up the chemicals in entries like the first, that have numbers in their name separated by commas (i.e. 2,4,5-TP).

Is there an easy pythonic way to do this?

Upvotes: 1

Views: 2142

Answers (3)

Jon
Jon

Reputation: 1241

I explain a little bit based on @eph's answer:

import re

data_list = ['2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP', 'Lead,Paints/Pigments,Zinc']
for d in data_list:
    print re.split(r'(?<=\D),\s*|\s*,(?=\D)',d)

re.split(pattern, string) will split string by the occurrences of regex pattern. (plz read Regex Quick Start if you are not familiar with regex.)

The (?<=\D),\s*|\s*,(?=\D) consists of two part: (?<=\D),\s* and \s*,(?=\D). The meaning of each unit:

  • The middle | is the OR operator.
  • \D matches a single character that is not a digit.
  • \s matches a whitespace character (includes tabs and line breaks).
  • , matches character ",".
  • * attempts to match the preceding token zero or more times. Therefore, \s* means the whitespace can be appear zero or more times. (see Repetition with Star and Plus)
  • (?<= ... ) and (?= ...) are the lookbebind and lookahead assertions. For example, q(?=u) matches a q that is followed by a u, without making the u part of the match.

Therefore, \s*,(?=\D) matches a , that is preceded by zero or more whitespace and followed by non-digit characters. Similarly, (?<=\D),\s* matches a , that is preceded by non-digit characters and followed by zero or more whitespace. The whole regex will find , that satisfy either case, which is equivalent to your requirement: ',' with only two non-numeric characters on either side.

Some useful tools for regex:

Upvotes: 2

eph
eph

Reputation: 2028

Use regex and lookbehind/lookahead assertion

>>> re.split(r'(?<=\D\D),\s*|,\s*(?=\D\D)', s)
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']

Upvotes: 1

Mayur Koshti
Mayur Koshti

Reputation: 1862

>>> s1 = "2,4-D, Benzo(a)pyrene, Dioxin, PCP, 2,4,5-TP"
>>> s2 = "Lead,Paints/Pigments,Zinc"
>>> import re
>>> res1 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s1)
>>> res1
['2,4-D', 'Benzo(a)pyrene', 'Dioxin', 'PCP', '2,4,5-TP']
>>> res2 = re.findall(r"\s*(.*?[A-Za-z])(?:,|$)", s2)
>>> res2
['Lead', 'Paints/Pigments', 'Zinc']

Upvotes: 0

Related Questions