Reputation: 47
The following is an example of an Excel worksheet I am working with:
Product_Type | Part_Number | Description | Size |
---|---|---|---|
Rod | R-SS-015 | ||
Rod | R-SS-030 | ||
Rod | R-SS-045 | ||
Rod | R-SS-060 | ||
Rod | R-SS-075 | ||
Rod | R-SS-090 | ||
Nut | N-150 | Stainless Steel 1" Nut | |
Nut | N-151 | Stainless Steel 2" Nut | |
Nut | N-152 | Stainless Steel 3" Nut | |
Washer | W-101 | Stainless Steel 1" Washer | |
Washer | W-102 | Stainless Steel 2" Washer | |
Washer | W-103 | Stainless Steel 3" Washer |
What I am trying to do is get the size of the "Rod" from the Part_Number and the size of the "Nut/Washer" from the description. In Excel I would use a combination of IFS, Search, and Left/Mid/Right functions.
I am able to find the size of the product types separately but am having trouble connecting to the Product Type.
As in:
"If Product_Type is equal to Rod then search the Part_Number and output to Size"
or
"If Product_Type is equal to Nut or Washer then search the Description and output to Size"
For the Rod I am using the following code:
conditions = [
df.loc[df['Part_Number'].str.contains('-015'), 'Size'] == '1"',
df.loc[df['Part_Number'].str.contains('-030'), 'Size'] == '2"',
df.loc[df['Part_Number'].str.contains('-045'), 'Size'] == '3"',
df.loc[df['Part_Number'].str.contains('-060'), 'Size'] == '4"',
df.loc[df['Part_Number'].str.contains('-075'), 'Size'] == '5"',
df.loc[df['Part_Number'].str.contains('-090'), 'Size'] == '6"',
]
df['Size'] = np.select(df[Product_Type] == "Rod", conditions, '')
For Nut/Washer I am using the following code:
df_2 = df['Descritpion'].str.extract(r'(?:\s([\d+]?\.?[\d+])")', expand=False)
df['Size'] = np.where(df['Product_Type'] == "Nut" or df['Product_Type'] == "Washer", df_2, '')
For both of these codes I am getting the following error:
ValueError: list of cases must be same length as list of conditions
This is because there is only one value (Rod) and multiple conditions (the sizes) correct?
How can I combine both str.extract and str.contains?
Upvotes: 0
Views: 459
Reputation: 25374
Here are three approaches which will produce the same result.
This will work with a slight change to your np.select()
example. The key thing to do is to make sure both your conditions and replacements are lists of the same length.
conditions = [
(df['Product_Type'] == 'Rod') & (df['Part_Number'].str.contains('-015')),
(df['Product_Type'] == 'Rod') & (df['Part_Number'].str.contains('-030')),
(df['Product_Type'] == 'Rod') & (df['Part_Number'].str.contains('-045')),
(df['Product_Type'] == 'Rod') & (df['Part_Number'].str.contains('-060')),
(df['Product_Type'] == 'Rod') & (df['Part_Number'].str.contains('-075')),
(df['Product_Type'] == 'Rod') & (df['Part_Number'].str.contains('-090')),
df['Product_Type'] == 'Nut',
df['Product_Type'] == 'Washer'
]
replacements = [
'1',
'2',
'3',
'4',
'5',
'6',
df['Description'].str.extract(r'\s(\d+)"', expand=False),
df['Description'].str.extract(r'\s(\d+)"', expand=False),
]
df['Size'] = np.select(conditions, replacements, default=None).astype(float)
As a shorter and more flexible alternative, there's also .apply(..., axis=1)
as a way to apply a function to each row.
Here I use if statements to decide which kind of product it is, then use a dictionary to convert the part number ending to the size you want.
def get_size(row):
if row['Product_Type'] == 'Rod':
rod_size = re.search(r'-(\d+)$', row['Part_Number']).group(1)
rod_convert = {
'015': 1,
'030': 2,
'045': 3,
'060': 4,
'075': 5,
'090': 6,
}
return rod_convert[rod_size]
elif row['Product_Type'] == 'Nut' or row['Product_Type'] == 'Washer':
return re.search(r'(\d+)"', row['Description']).group(1)
else:
return None
df['Size'] = df.apply(get_size, axis=1).astype(float)
However, that flexibility comes at a speed penalty.
This is similar to the previous approach, but where the previous solution runs a function on each row of the dataframe, this approach splits the dataframe up into three sections, and runs the function on each dataframe.
After splitting into three dataframes, I check which split this is (using df.name
) and extract the size with a regex. In the case of Rods, I use map() to substitute the part number ending for the size you want.
def get_size(df):
if df.name == 'Rod':
size = df['Part_Number'].str.extract(r'-(\d+)$', expand=False)
df['Size'] = size.map({
'015': 1,
'030': 2,
'045': 3,
'060': 4,
'075': 5,
'090': 6,
})
return df
elif df.name == 'Nut' or df.name == 'Washer':
df['Size'] = df['Description'].str.extract(r'(\d+)"', expand=False).astype(float)
return df
df = df.groupby('Product_Type').apply(get_size)
All three produce the same result, which is:
Product_Type Part_Number Description Size
0 Rod R-SS-015 NaN 1.0
1 Rod R-SS-030 NaN 2.0
2 Rod R-SS-045 NaN 3.0
3 Rod R-SS-060 NaN 4.0
4 Rod R-SS-075 NaN 5.0
5 Rod R-SS-090 NaN 6.0
6 Nut N-150 Stainless Steel 1" Nut 1.0
7 Nut N-151 Stainless Steel 2" Nut 2.0
8 Nut N-152 Stainless Steel 3" Nut 3.0
9 Washer W-101 Stainless Steel 1" Washer 1.0
10 Washer W-102 Stainless Steel 2" Washer 2.0
11 Washer W-103 Stainless Steel 3" Washer 3.0
Upvotes: 1