intStdu
intStdu

Reputation: 291

Conditional replace rows based on value in other dataframe

I have two dataframes, I want to reduce the information in the first dataframe that looks like:

event_timestamp      message_number  an_robot
2015-04-15 12:09:39  10125            robot_7
2015-04-15 12:09:41  10053            robot_4
2015-04-15 12:09:44  10156            robot_7
2015-04-15 12:09:47  20205            robot_108
2015-04-15 12:09:51  10010            robot_38
2015-04-15 12:09:54  10012            robot_65
2015-04-15 12:09:59  10011            robot_39

My other dataframe looks like this:

sequence             support
10053,10156,20205    0.94783
10010,10012          0.93322

I want to replace all the sequences that are in dataframe 1, that are apparent in dataframe 2. So the new dataframe should be:

event_timestamp      message_number    an_robot
2015-04-15 12:09:39  10125              robot_7
2015-04-15 12:09:41  10053,10156,20205  robot_4,robot_7,robot_108
2015-04-15 12:09:51  10010,10012        robot_38,robot_65
2015-04-15 12:09:59  10011              robot_39

Anybody who knows how to achieve this? I know how to find if the values match up in exactly one row, but not comparing multiple rows that have to be exactly after each other.

--- EDIT ---

maybe to make it a bit more simple, it would also be fine to generate a new message_number for a sequence. So the new dataframe could be:

event_timestamp      message_number    an_robot
2015-04-15 12:09:39  10125              robot_7
2015-04-15 12:09:41  1                  robot_4,robot_7,robot_108
2015-04-15 12:09:51  2                  robot_38,robot_65
2015-04-15 12:09:59  10011              robot_39

Where each sequence found in the sequence dataframe would be written as a 0, 1, 2, 3 or 4 (till the last sequence). I could always update the database with the meaning of the message_number codes with these new numbers. It would be great though to keep the information about which robot did it, but if that is too complex then this is fine as well.

Upvotes: 0

Views: 174

Answers (2)

BENY
BENY

Reputation: 323226

I am using unnesting for your df2, then map back the rule back to df, and get the groupkey , then using groupby with agg

df1.sequence=df1.sequence.str.split(',')
s=unnesting(df1,['sequence'])

groupkey=df.message_nummber.map(dict(zip(s.sequence.astype(int),s.index))).fillna(df.message_nummber)

df.groupby(groupkey).agg({'event_timestamp':'first','message_nummber':lambda x : ','.join(str(x)),'an_robot':','.join})
                    event_timestamp            ...                               an_robot
message_nummber                                ...
0.0              2015-04-1512:09:41            ...              robot_4,robot_7,robot_108
1.0              2015-04-1512:09:51            ...                      robot_38,robot_65
10011.0          2015-04-1512:09:59            ...                               robot_39
10125.0          2015-04-1512:09:39            ...                                robot_7
[4 rows x 3 columns]

def unnesting(df, explode):
    idx = df.index.repeat(df[explode[0]].str.len())
    df1 = pd.concat([
        pd.DataFrame({x: np.concatenate(df[x].values)}) for x in explode], axis=1)
    df1.index = idx

    return df1.join(df.drop(explode, 1), how='left')

Upvotes: 1

DataPsycho
DataPsycho

Reputation: 988

If you are interested to make it simple it will be a bit long but good looking work flow or will look like a data pipline.

df_str_2 = """sequence|support
10053,10156,20205|0.94783
10010,10012|0.93322"""

df_2 = pd.read_csv(io.StringIO(df_str_2), sep='|')

# step 1: transform the df 2
# add a id column
df_2["_id"] = df_2.index + 1 
# split sequence to list
df_2["sequence"] = df_2.sequence.apply(lambda x: x.split(",") if isinstance(x, str) else [])

# put each item from the list to a new row
trns_df_2 = (
    df_2.sequence.apply(pd.Series)
    .merge(df_2, right_index=True, left_index=True)
    .drop(["sequence"], axis=1)
    .melt(id_vars=['support', '_id'], value_name="message_number")
    .drop(["variable", "support"], axis=1)
    .dropna()
    .sort_values("_id", ascending=True)
)
# step 2: merge with df 1
df_str_1 = """event_timestamp|message_number|an_robot
2015-04-15 12:09:39|10125|robot_7
2015-04-15 12:09:41|10053|robot_4
2015-04-15 12:09:44|10156|robot_7
2015-04-15 12:09:47|20205|robot_108
2015-04-15 12:09:51|10010|robot_38
2015-04-15 12:09:54|10012|robot_65
2015-04-15 12:09:59|10011|robot_39"""

df_1 = pd.read_csv(io.StringIO(df_str_1), sep='|')
df_1["message_number"] = df_1.message_number.astype(str)

merged_df = df_1.merge(trns_df_2, on="message_number", how="left")

# take only the inner join and group them by id and other column to list
main_df_inner = (
    merged_df[merged_df["_id"].notnull()]
    .groupby("_id")
    .agg({"event_timestamp": lambda x: list(x),
          "message_number": lambda x: list(x),
          "an_robot": lambda x: list(x)})
    .reset_index()
    .drop("_id", axis=1)
)

# joined the list items in to a list
main_df_inner["event_timestamp"] = main_df_inner.event_timestamp.apply(lambda x: x[0])
main_df_inner["message_number"] = main_df_inner.message_number.apply(lambda x: ",".join(x))
main_df_inner["an_robot"] = main_df_inner.an_robot.apply(lambda x: ",".join(x))

# take only the left part
main_df_left = merged_df[merged_df["_id"].isnull()].drop("_id", axis=1)

# concate the both part and make the final df
main_df = pd.concat([main_df_left, main_df_inner])

The thing left is convert the event_timestamp column to datetime using pd.to_datetime and order the data frame by event_timestamp. I think you can do it by yourself.

Upvotes: 1

Related Questions