baxx
baxx

Reputation: 4755

Create a column of classifications based on the contents of a column in pandas

Given the following data

random.seed(1)
import pandas as pd
import random
starts = ['a. ', 'bc. ', '']
v = pd.Series( [f"{s}foo{i}" for i,s in enumerate([
    random.choice(starts) for _ in range(5)])])

which looks like

In [284]: v
Out[284]:
0     a. foo0
1        foo1
2     a. foo2
3    bc. foo3
4     a. foo4
dtype: object

I would like to create a column which classifies v based on its prefix, which would be as follows:

original col       classification

 a. foo0      ->   type_a
    foo1      ->   neither
 a. foo2      ->   type_a
bc. foo3      ->   type_bc
 a. foo4      ->   type_a

The solution should apply to a dataframe, the following for example

random.seed(1)
import pandas as pd
import random
starts = ['a. ', 'bc. ', '']
df = pd.DataFrame( {
    'A' : [f"{s}foo{i}" for i,s in enumerate([
        random.choice(starts) for _ in range(5)])],
    'B' : [random.randint(10,20) for _ in range(5)] })

Could be processed as:

In [292]: df
Out[292]:
          A   B  class
0   a. foo0  17  type_one
1      foo1  17  neither
2   a. foo2  17  type_one
3  bc. foo3  20  type_two
4   a. foo4  16  type_one

edit

This approach is nice :

('type_' + v.str.extract(r'^([^\.]+)\.')).fillna('neither')
# also 
v.str.extract(r'^([^\.]+)\.').radd('type_').fillna('neither')

But it depends on the values in the current data, I would like for the solution to be independent of the current values, for example; the solution might be in the form

In [292]: df
Out[292]:
          A   B  class
0   a. foo0  17  type_one
1      foo1  17  neither
2   a. foo2  17  type_one
3  bc. foo3  20  type_two
4   a. foo4  16  type_one

Upvotes: 0

Views: 87

Answers (2)

jkr
jkr

Reputation: 19320

One option is to mask the dataframe and set values according to the mask.

Using the following data as in the original post:

# sys.version
# '3.7.6 (default, Dec 30 2019, 19:38:28) \n[Clang 11.0.0 (clang-1100.0.33.16)]'
import pandas as pd
import random
random.seed(1)

starts = ['a. ', 'bc. ', '']
df = pd.DataFrame( {
    'A' : [f"{s}foo{i}" for i,s in enumerate([random.choice(starts) for _ in range(5)])],
    'B' : [random.randint(10,20) for _ in range(5)] })

Replacements could be carried out as follows:

df.loc[df['A'].str.startswith("a."), "class"] = "type_a"
df.loc[df['A'].str.startswith("bc."), "class"] = "type_bc"
df.loc[:, 'class'].fillna("neither", inplace=True)

A cleaner approach might be to store prefixes and corresponding "types" in a mapping, then modify the dataframe according to it:

# df is the same data as originally created
mapping = {
    "a.": "type_a",
    "bc.": "type_bc",
}
for k, v in mapping.items():
    mask = df["A"].str.startswith(k)
    df.loc[mask, "class"] = v
df["class"].fillna("neither", inplace=True)

Upvotes: 2

AMC
AMC

Reputation: 2702

This works with the example you provided, although there is definitely a risk that it relies on properties which the actual data does not have. Let me know! :)

import numpy as np
import pandas as pd

df_data = {'A': {0: 'a. foo0', 1: 'foo1', 2: 'a. foo2', 3: 'bc. foo3', 4: 'a. foo4'},
           'B': {0: 17, 1: 17, 2: 17, 3: 20, 4: 16}}
df = pd.DataFrame(data=df_data)

print(df)

type_map = {'a.': 'type_one', 'bc.': 'type_two', np.NaN: 'type_neither'}
df['A_type'] = df['A'].str.extract(r"^(\S+\.)\s", expand=False).map(type_map)

print(df)

Output:

          A   B
0   a. foo0  17
1      foo1  17
2   a. foo2  17
3  bc. foo3  20
4   a. foo4  16
          A   B        A_type
0   a. foo0  17      type_one
1      foo1  17  type_neither
2   a. foo2  17      type_one
3  bc. foo3  20      type_two
4   a. foo4  16      type_one

Upvotes: 1

Related Questions