Create a column of classifications based on the contents of a column in pandas

Question

Given the following data

random.seed(1)
import pandas as pd
import random
starts = ['a. ', 'bc. ', '']
v = pd.Series( [f"{s}foo{i}" for i,s in enumerate([
    random.choice(starts) for _ in range(5)])])

which looks like

In [284]: v
Out[284]:
0     a. foo0
1        foo1
2     a. foo2
3    bc. foo3
4     a. foo4
dtype: object

I would like to create a column which classifies v based on its prefix, which would be as follows:

original col       classification

 a. foo0      ->   type_a
    foo1      ->   neither
 a. foo2      ->   type_a
bc. foo3      ->   type_bc
 a. foo4      ->   type_a

The solution should apply to a dataframe, the following for example

random.seed(1)
import pandas as pd
import random
starts = ['a. ', 'bc. ', '']
df = pd.DataFrame( {
    'A' : [f"{s}foo{i}" for i,s in enumerate([
        random.choice(starts) for _ in range(5)])],
    'B' : [random.randint(10,20) for _ in range(5)] })

Could be processed as:

In [292]: df
Out[292]:
          A   B  class
0   a. foo0  17  type_one
1      foo1  17  neither
2   a. foo2  17  type_one
3  bc. foo3  20  type_two
4   a. foo4  16  type_one

edit

This approach is nice :

('type_' + v.str.extract(r'^([^\.]+)\.')).fillna('neither')
# also 
v.str.extract(r'^([^\.]+)\.').radd('type_').fillna('neither')

But it depends on the values in the current data, I would like for the solution to be independent of the current values, for example; the solution might be in the form

In [292]: df
Out[292]:
          A   B  class
0   a. foo0  17  type_one
1      foo1  17  neither
2   a. foo2  17  type_one
3  bc. foo3  20  type_two
4   a. foo4  16  type_one

jkr · Accepted Answer

One option is to mask the dataframe and set values according to the mask.

Using the following data as in the original post:

# sys.version
# '3.7.6 (default, Dec 30 2019, 19:38:28) 
[Clang 11.0.0 (clang-1100.0.33.16)]'
import pandas as pd
import random
random.seed(1)

starts = ['a. ', 'bc. ', '']
df = pd.DataFrame( {
    'A' : [f"{s}foo{i}" for i,s in enumerate([random.choice(starts) for _ in range(5)])],
    'B' : [random.randint(10,20) for _ in range(5)] })

Replacements could be carried out as follows:

df.loc[df['A'].str.startswith("a."), "class"] = "type_a"
df.loc[df['A'].str.startswith("bc."), "class"] = "type_bc"
df.loc[:, 'class'].fillna("neither", inplace=True)

A cleaner approach might be to store prefixes and corresponding "types" in a mapping, then modify the dataframe according to it:

# df is the same data as originally created
mapping = {
    "a.": "type_a",
    "bc.": "type_bc",
}
for k, v in mapping.items():
    mask = df["A"].str.startswith(k)
    df.loc[mask, "class"] = v
df["class"].fillna("neither", inplace=True)

Create a column of classifications based on the contents of a column in pandas

edit

Answers (2)

Related Questions