Reputation: 4755
Given the following data
random.seed(1)
import pandas as pd
import random
starts = ['a. ', 'bc. ', '']
v = pd.Series( [f"{s}foo{i}" for i,s in enumerate([
random.choice(starts) for _ in range(5)])])
which looks like
In [284]: v
Out[284]:
0 a. foo0
1 foo1
2 a. foo2
3 bc. foo3
4 a. foo4
dtype: object
I would like to create a column which classifies v
based on its prefix,
which would be as follows:
original col classification
a. foo0 -> type_a
foo1 -> neither
a. foo2 -> type_a
bc. foo3 -> type_bc
a. foo4 -> type_a
The solution should apply to a dataframe, the following for example
random.seed(1)
import pandas as pd
import random
starts = ['a. ', 'bc. ', '']
df = pd.DataFrame( {
'A' : [f"{s}foo{i}" for i,s in enumerate([
random.choice(starts) for _ in range(5)])],
'B' : [random.randint(10,20) for _ in range(5)] })
Could be processed as:
In [292]: df
Out[292]:
A B class
0 a. foo0 17 type_one
1 foo1 17 neither
2 a. foo2 17 type_one
3 bc. foo3 20 type_two
4 a. foo4 16 type_one
This approach is nice :
('type_' + v.str.extract(r'^([^\.]+)\.')).fillna('neither')
# also
v.str.extract(r'^([^\.]+)\.').radd('type_').fillna('neither')
But it depends on the values in the current data, I would like for the solution to be independent of the current values, for example; the solution might be in the form
In [292]: df
Out[292]:
A B class
0 a. foo0 17 type_one
1 foo1 17 neither
2 a. foo2 17 type_one
3 bc. foo3 20 type_two
4 a. foo4 16 type_one
Upvotes: 0
Views: 87
Reputation: 19320
One option is to mask the dataframe and set values according to the mask.
Using the following data as in the original post:
# sys.version
# '3.7.6 (default, Dec 30 2019, 19:38:28) \n[Clang 11.0.0 (clang-1100.0.33.16)]'
import pandas as pd
import random
random.seed(1)
starts = ['a. ', 'bc. ', '']
df = pd.DataFrame( {
'A' : [f"{s}foo{i}" for i,s in enumerate([random.choice(starts) for _ in range(5)])],
'B' : [random.randint(10,20) for _ in range(5)] })
Replacements could be carried out as follows:
df.loc[df['A'].str.startswith("a."), "class"] = "type_a"
df.loc[df['A'].str.startswith("bc."), "class"] = "type_bc"
df.loc[:, 'class'].fillna("neither", inplace=True)
A cleaner approach might be to store prefixes and corresponding "types" in a mapping, then modify the dataframe according to it:
# df is the same data as originally created
mapping = {
"a.": "type_a",
"bc.": "type_bc",
}
for k, v in mapping.items():
mask = df["A"].str.startswith(k)
df.loc[mask, "class"] = v
df["class"].fillna("neither", inplace=True)
Upvotes: 2
Reputation: 2702
This works with the example you provided, although there is definitely a risk that it relies on properties which the actual data does not have. Let me know! :)
import numpy as np
import pandas as pd
df_data = {'A': {0: 'a. foo0', 1: 'foo1', 2: 'a. foo2', 3: 'bc. foo3', 4: 'a. foo4'},
'B': {0: 17, 1: 17, 2: 17, 3: 20, 4: 16}}
df = pd.DataFrame(data=df_data)
print(df)
type_map = {'a.': 'type_one', 'bc.': 'type_two', np.NaN: 'type_neither'}
df['A_type'] = df['A'].str.extract(r"^(\S+\.)\s", expand=False).map(type_map)
print(df)
Output:
A B
0 a. foo0 17
1 foo1 17
2 a. foo2 17
3 bc. foo3 20
4 a. foo4 16
A B A_type
0 a. foo0 17 type_one
1 foo1 17 type_neither
2 a. foo2 17 type_one
3 bc. foo3 20 type_two
4 a. foo4 16 type_one
Upvotes: 1