juliano.net
juliano.net

Reputation: 8177

Create a new column with the value found in another DataFrame

I have two DataFrames:

I need to create a new column in the dataset with the ID of the component according to the df_components.

I tried to do this way:

Creating the df_components and the ID column based on the index

components = dataset["COMPDESC"].unique()
df_components = pd.DataFrame(components, columns=['DESCRIPTION'])
df_components.sort_values(by='DESCRIPTION', ascending=True, inplace=True)
df_components.reset_index(drop=True, inplace=True)
df_components.index += 1
df_components['ID'] = df_components.index

Sample output:

                                           DESCRIPTION   ID
1                                             AIR BAGS    1
2                                     AIR BAGS:FRONTAL    2
3               AIR BAGS:FRONTAL:SENSOR/CONTROL MODULE    3
4                                 AIR BAGS:SIDE/WINDOW    4

Create the COMP_ID in the dataset:

def create_component_id_column(row):
    found = df_components[df_components['DESCRIPTION'] == row['COMPDESC']]
    return found.ID if len(found.index) > 0 else None

dataset['COMP_ID'] = dataset.apply(lambda row: create_component_id_column(row), axis=1)

However this gives me the error ValueError: Wrong number of items passed 248, placement implies 1. Being 248 the number of items on df_components.

How can I create this new column with the ID from the item found on df_components?

Upvotes: 2

Views: 104

Answers (1)

jpp
jpp

Reputation: 164613

Your logic seems overcomplicated. Since you are currently creating df_components from dataset, a better idea would be to use Categorical Data with dataset. This means you do not need to create df_components.

Step 1

Convert dataset['COMPDESC'] to categorical.

dataset['COMPDESC'] = dataset['COMPDESC'].astype('category')

Step 2

Create ID from categorical codes. Since categories are alphabetically sorted by default and indexing starts from 0, add 1 to the codes.

dataset['ID'] = dataset['COMPDESC'].cat.codes + 1

If you wish, you can extract the entire categorical mapping to a dictionary:

cat_map = dict(enumerate(dataset['COMPDESC'].cat.categories))

Remember that there always be a 1-offset if you want your IDs to begin at 1. In addition, you will need to update 'ID' explicitly every time 'DESCRIPTION' changes.

Advantages of using categorical data

  • Memory efficient: strings are only stored once.
  • Structure: you define the categories and have an automatic layer of data validation.
  • Consistent: since category to code mappings are always 1-to-1, they will always be consistent, even when new categories are added.

Upvotes: 2

Related Questions