Jack Avante
Jack Avante

Reputation: 1595

Best way to change column data for all rows over multiple dataframes in pandas?

Consider dataframes df1, df2, and df3.

df1 and df2 have an id column, and df3 has a from_id and to_id column.

I need to iterate over all rows of df3, and replace from_id and to_id with new unique randomly generated UUIDs, and then update those in df1 and df2 where (id == from_id) | (id == to_id) (before the change to UUID).

I originally wanted to iterate over all rows of df3 and simply check both df1 and df2 if they contain the original from_id or to_id inside the id column before replacing both, but I found that iterating over pandas rows is a bad idea and slow.

I'm not sure how I can apply the other mentioned methods in that post to this problem since I'm not applying a simple function or calculating anything, and I think the way I had intended to do it would be too slow for big dataframes.

My current method that I believe to be slow and inefficient:

import pandas as pd
import uuid

def rand_uuid():
    return uuid.uuid1()

def update_ids(df_places: pd.DataFrame, df_transitions: pd.DataFrame, df_arcs: pd.DataFrame) -> Tuple[pd.DataFrame, pd.DataFrame, pd.DataFrame]:
    for i in range(len(df_arcs)):
        new_uuid_from = __rand_uuid()
        new_uuid_to = __rand_uuid()
        new_uuid_arc = __rand_uuid()

        df_transitions.loc[df_transitions.id == df_arcs.iloc[i]['sourceId'], 'id'] = new_uuid_from
        df_transitions.loc[df_transitions.id == df_arcs.iloc[i]['destinationId'], 'id'] = new_uuid_to

        df_places.loc[df_places.id == df_arcs.iloc[i]['sourceId'], 'id'] = new_uuid_from
        df_places.loc[df_places.id == df_arcs.iloc[i]['destinationId'], 'id'] = new_uuid_to

        df_arcs.iloc[i]['sourceId'] = new_uuid_from
        df_arcs.iloc[i]['destinationId'] = new_uuid_to
        df_arcs.iloc[i]['id'] = new_uuid_arc

    return df_places, df_transitions, df_arcs

Here df_places and df_transitions are above mentioned df1 and df2, and df_arcs is df3

Example df_places

+---+----+
|   | id |
+---+----+
| 1 | a1 |
+---+----+
| 2 | c1 |
+---+----+

Example df_transitions:

+---+----+
|   | id |
+---+----+
| 1 | b1 |
+---+----+

Example df_arcs:

+---+----------+---------------+
|   | sourceId | destinationId |
+---+----------+---------------+
| 1 | a1       | b1            |
+---+----------+---------------+
| 2 | b1       | c1            |
+---+----------+---------------+

Upvotes: 0

Views: 123

Answers (1)

PieCot
PieCot

Reputation: 3639

A very simple approach:

import itertools
import uuid

def rand_uuid():
    return uuid.uuid4()

rep_dict = {i: rand_uuid() for i in itertools.chain(df1.id, df2.id)}

df3.replace(rep_dict, inplace=True)
df3.id = df3.id.map(lambda x: rand_uuid())

df1.replace(rep_dict, inplace=True)
df2.replace(rep_dict, inplace=True)

Upvotes: 2

Related Questions