unpairestgood
unpairestgood

Reputation: 405

Add UUID's to pandas DF

Say I have a pandas DataFrame like so:

df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df:
        Name
    0   John Doe
    1   Jane Smith
    2   John Doe
    3   Jane Smith
    4   Jack Dawson
    5   John Doe

And I want to add a column with uuids that are the same if the name is the same. For example, the DataFrame above should become:

df:
            Name        UUID
        0   John Doe    6d07cb5f-7faa-4893-9bad-d85d3c192f52
        1   Jane Smith  a709bd1a-5f98-4d29-81a8-09de6e675b56
        2   John Doe    6d07cb5f-7faa-4893-9bad-d85d3c192f52
        3   Jane Smith  a709bd1a-5f98-4d29-81a8-09de6e675b56
        4   Jack Dawson 6a495c95-dd68-4a7c-8109-43c2e32d5d42
        5   John Doe    6d07cb5f-7faa-4893-9bad-d85d3c192f52

The uuid's should be generated from the uuid.uuid4() function.

My current idea is to use a groupby("Name").cumcount() to identify which rows have the same name and which are different. Then I'd create a dictionary with a key of the cumcount and a value of the uuid and use that to add the uuids to the DF.

While that would work, I'm wondering if there's a more efficient way to do this?

Upvotes: 12

Views: 16579

Answers (2)

MaFF
MaFF

Reputation: 10076

Grouping the data frame and applying uuid.uuid4 will be more efficient than looping through the groups. Since you want to keep the original shape of your data frame you should use pandas function transform.

Using your sample data frame, we'll add a column in order to have a series to apply transform to. Since uuid.uuid4 doesn't take any argument it really doesn't matter what the column is.

df = pd.DataFrame({'Name': ['John Doe', 'Jane Smith', 'John Doe', 'Jane Smith','Jack Dawson','John Doe']})
df.loc[:, "UUID"] = 1

Now to use transform:

import uuid
df.loc[:, "UUID"] = df.groupby("Name").UUID.transform(lambda g: uuid.uuid4())

+----+--------------+--------------------------------------+
|    |    Name      |                 UUID                 |
+----+--------------+--------------------------------------+
| 0  | John Doe     | c032c629-b565-4903-be5c-81bf05804717 |
| 1  | Jane Smith   | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 2  | John Doe     | c032c629-b565-4903-be5c-81bf05804717 |
| 3  | Jane Smith   | a5434e69-bd1c-4d29-8b14-3743c06e1941 |
| 4  | Jack Dawson  | 6b843d0f-ba3a-4880-8a84-d98c4af09cc3 |
| 5  | John Doe     | c032c629-b565-4903-be5c-81bf05804717 |
+----+--------------+--------------------------------------+

uuid.uuid4 will be called as many times as there are distinct groups

Upvotes: 14

SO44
SO44

Reputation: 1329

How about this

names = df['Name'].unique()
for name in names:
    df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

could shorten it to

for name in df['Name'].unique():
    df.loc[df['Name'] == name, 'UUID'] = uuid.uuid4()

Upvotes: 7

Related Questions