String extraction in Python / Pandas with repeated delimiter

Question

I have a data frame with a column that includes any combination of one or many variables, separated by a '/' delimiter, e.g.:

Rd/MLERS
Rd
Rd          
Rd/DLEPC/DLERS
SLERS
MLERS

Etc., etc. I want to extract the primary classifier, i.e.: the only or the first variable immediately preceding the first '/' character. I don't have a lot of experience with str.extract and my effort -

df["primaryEjecta1"] = df["MORPHOLOGY_EJECTA_1"].str.extract('(.*)/', expand=True)

does not work as anticipated -

Rd
NaN
NaN
Rd/DLEPC
NaN
NaN

Specifically -

Where there is only one variable, I am inadvertently converting this to NaN;
Where there are three (or more) variables, I am extracting the first two (or more), rather than only the first.

Sure this simple to fix if you know how - but most of the examples and tutorials that I have been able to find on-line assume nice, neat delimiters that are not repeated - so appreciate any help that you guys can offer.

MaxU - stand with Ukraine · Accepted Answer

you can use powerful extract() method:

In [31]: df
Out[31]:
              txt
0        Rd/MLERS
1              Rd
2              Rd
3  Rd/DLEPC/DLERS
4           SLERS
5           MLERS

In [32]: df['clsfr'] = df['txt'].str.extract(r'([^\/]+)', expand=True)

In [33]: df
Out[33]:
              txt  clsfr
0        Rd/MLERS     Rd
1              Rd     Rd
2              Rd     Rd
3  Rd/DLEPC/DLERS     Rd
4           SLERS  SLERS
5           MLERS  MLERS

Explanation:

RegEx ([^\/]+) - means take anything except / (and until the first occurrence of /) into the first group

String extraction in Python / Pandas with repeated delimiter

Answers (2)

Related Questions