Reputation: 63
Current dataframe is as follows:
df = pd.read_csv('filename.csv', delimiter=',')
print(df)
idx uniqueID String CaseType
0 1 'hello1' 1.0
1 1 'hello2' 1.0
2 1 'goodbye1' 1.0
3 1 'goodbye2' 1.0
4 2 'hello1' 3.0
5 2 'hello2' 3.0
6 2 'hello3' 3.0
7 3 'goodbye1' 1.0
8 3 'goodbye2' 1.0
9 3 'goodbye3' 1.0
10 4 'hello' 2.0
11 4 'goodbye' 2.0
Expected Output: (Please note they are grouped based on uniqueID, and the case Type follows the last string of the uniqueID.)
idx Source Destination
0 'hello1' 'hello2'
1 'hello2' 'goodbye1'
3 'goodbye1' 'goodbye2'
4 'goodbye2' '1.0'
6 'hello1' 'hello2'
7 'hello2' 'hello3'
8 'hello3' '3.0'
10 'goodbye1' 'goodbye2'
11 'goodbye2' 'goodbye3'
12 'goodbye3' '1.0'
13 'hello' 'goodbye'
14 'goodbye' '2.0'
Question: How do I transform the pandas dataframe in this way?
Currently, I am iterating through every row in a for loop, and for each uniqueId, adding each string+CaseType (at the end) to a list, then splitting up that list and adding it to a new dataframe. It is incredibly slow.
Following this, the next step is to get the total counts/occurences for each row of the output. Essentially, if there are duplicate rows of source:destination (ie, we have 3 rows of 'hello' 'goodbye', it would result in 1 row with 'hello' 'goodbye' with their count as the 3rd column)
Example:
Original:
idx Source Destination
0 'hello1' 'hello2'
1 'hello2' 'hello3'
2 'hello1' 'hello2'
3 'hello4' 'goodbye'
Expected Output:
idx Source Destination Count
0 'hello1' 'hello2' 2
1 'hello2' 'hello3' 1
2 'hello4' 'goodbye' 1
I presume the first step is slightly more complex with pandas logic, and the next step is essentially just combining duplicates and getting their count, but I am new to pandas and not entirely sure how to do either. Thank you in advance.
Upvotes: 1
Views: 74
Reputation: 11321
You could try the following:
df_res = df[["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
df.groupby("uniqueID")["String"].transform("shift", -1)
.fillna(df["CaseType"])
.astype("str")
)
Result for your sample is:
Source Destination
0 hello1 hello2
1 hello2 goodbye1
2 goodbye1 goodbye2
3 goodbye2 1.0
4 hello1 hello2
5 hello2 hello3
6 hello3 3.0
7 goodbye1 goodbye2
8 goodbye2 goodbye3
9 goodbye3 1.0
10 hello goodbye
11 goodbye 2.0
Regarding the second part: Try
df_counts = (
df_res.value_counts(["Source", "Destination"])
.to_frame(name="Counts").reset_index()
)
to get
Source Destination Counts
0 goodbye1 goodbye2 2
1 hello1 hello2 2
2 goodbye 2.0 1
3 goodbye2 1.0 1
4 goodbye2 goodbye3 1
5 goodbye3 1.0 1
6 hello goodbye 1
7 hello2 goodbye1 1
8 hello2 hello3 1
9 hello3 3.0 1
If you want to keep the Source
and Destination
columns in the index then remove .reset_index()
at the end.
Regarding the modification in the comment: You could try more or less the same, but restricted to a filtered df
:
approved = ["hello1", "hello"]
m = df.groupby("uniqueID")["String"].transform("first").isin(approved)
df_res = df.loc[m, ["String"]].rename(columns={"String": "Source"})
df_res["Destination"] = (
df[m].groupby("uniqueID")["String"].transform("shift", -1)
.fillna(df.loc[m, "CaseType"])
.astype("str")
)
df_res = df_res.reset_index(drop=True)
The mask m
removes those groups whose first row String
is not in the list of approved strings.
Result:
Source Destination
0 hello1 hello2
1 hello2 goodbye1
2 goodbye1 goodbye2
3 goodbye2 1.0
4 hello1 hello2
5 hello2 hello3
6 hello3 3.0
7 hello goodbye
8 goodbye 2.0
Upvotes: 1
Reputation: 7627
df = pd.DataFrame({'idx': [0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11], 'uniqueID': [1, 1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4],
'String': ["'hello1'", "'hello2'", "'goodbye1'", "'goodbye2'", "'hello1'", "'hello2'", "'hello3'",
"'goodbye1'", "'goodbye2'", "'goodbye3'", "'hello'", "'goodbye'"],
'CaseType': [1.0, 1.0, 1.0, 1.0, 3.0, 3.0, 3.0, 1.0, 1.0, 1.0, 2.0, 2.0]}).set_index('idx')
df = df.groupby('uniqueID').apply(
lambda x: pd.DataFrame({'Source': x.String, 'Destination': x.String.shift(-1, fill_value=f'\'{x.CaseType.iloc[0]}\'')}))
print(df)
Source Destination
idx
0 'hello1' 'hello2'
1 'hello2' 'goodbye1'
2 'goodbye1' 'goodbye2'
3 'goodbye2' '1.0'
4 'hello1' 'hello2'
5 'hello2' 'hello3'
6 'hello3' '3.0'
7 'goodbye1' 'goodbye2'
8 'goodbye2' 'goodbye3'
9 'goodbye3' '1.0'
10 'hello' 'goodbye'
11 'goodbye' '2.0'
Upvotes: 0