Reputation: 385
I am having trouble thinking pythonically about something, and would love some guidance.
I have a dataframe that contains columns with dates of events at which certain files should be uploaded, and a column with the names of those events. So events can be X, Y, Z, and files can be 1, 2, 3.
Not all files need to be uploaded at all events, i.e. if it's Event X, then files 1, 2, and 3 need to be uploaded, but if it's Event Y, then only file 3 needs to be uploaded. The date columns either have a date in them, or are blank.
What I want to do is, for all the files for events that are not needed, replace blank with "Not Needed".
Initial:
File1 File2 File3
X Aug 1 Sept 1
X Aug 3 Aug 4 Sept 9
Y Sept 10
Z Aug 12
X Aug 13 Aug 15
Z Aug 1
Goal
File1 File2 File3
X Aug 1 Sept 1
X Aug 3 Aug 4 Sept 9
Y NN NN Sept 10
Z Aug 12 NN NN
X Aug 13 Aug 15
Z Aug 1 NN NN
So in other words, for the blanks that SHOULD be blank because a file is not expected, replace that value with "Not Needed", while leaving the other blanks alone.
I have tried doing this with .replace(), .apply() with functions, and I am not having any success.
The code below sort of works, but it works not only when there is a match, but even when there is not a match.
Fill in descriptive text for scales not collected at certain visits (where upload dates would be blank)
df_combined['FAQ-Audio-upDate'] = np.where(df_combined['VisitName'] == "Screening", "FAQ Not Expected", "")
df_combined['FAQ-Form-upDate'] = np.where(df_combined['VisitName'] == "Screening", "FAQ Not Expected", "")
How can I change the value in one column based on the value in another column, across the entire dataframe? What I want is basically this:
For every row in the dataframe
If the value in the VisitName column == X
Change the value in ColumnA to "Not Expected"
Thank you!!
Upvotes: 0
Views: 6168
Reputation: 131
So, I can answer your base question pretty easily, but there's a couple of style things I think you might want to change that I'd like to get into. I'm fairly certain this question has been addressed in other threads, but you've got a couple of problems wrapped up in 1 so I'm just going to address them here
For every row in the dataframe If the value in the VisitName column == X Change the value in ColumnA to "Not Expected"
You want to be using index slices to set values. Get a boolean mask of the dataframe based on the logic you want, use that to create a new dataframe containing only the rows that you want to update, get the index of this new dataframe, and use this index with the original dataframe to change the values over.
import pandas as pd
df = pd.DataFrame(data=None, index=["X", "Y", "Z"], columns=["VisitName",
"ColumnA", "ColumnB"])
not_expected_index = df[df.loc[:, "VisitName"] == "X"].index
df.loc[not_expected_index, "ColumnA"] = "Not Expected"
This is the preferred way in pandas to change values in a DataFrame based on other values in another column.
Now, there's a couple of things about the original DataFrame you posted that I'd like to mention. First, if you already have Null values in the dataframe cells, then you can use the pandas dataframe fillna method to populate these values.
df.fillna("Not Expected")
Second, why do you want to use the string "NN" or "Not Needed" over the default Null value? For any operations within pandas, I prefer to stick with the actual null values, so that you can use aggregation functions like sum or count freely on dataframes with null values.
Second, the index contains repeated values:
df.index = ["X", "X", "Y", "Z", "X", "Z"]
Dataframes will allow duplicate index values, but they can behave in funny ways that you need to be aware of.
For example:
print(df)
returns
VisitName ColumnA ColumnB
X NaN NaN NaN
X NaN NaN NaN
Y NaN NaN NaN
Z NaN NaN NaN
X NaN NaN NaN
Z NaN NaN NaN
setting a value in VisitName for X
df.loc["X", "VisitName"] = "test"
returns
VisitName ColumnA ColumnB
X "test" NaN NaN
X "test" NaN NaN
Y NaN NaN NaN
Z NaN NaN NaN
X "test" NaN NaN
Z NaN NaN NaN
If I were tackling this problem, I'd either use the date as the index, with a True or False value in a file's column depending on whether it needs to be sent or not on that date,
index File1 File2 File3
8/01/17 True False True
8/08/17 False True True
8/15/17 True True False
8/24/17 False True False
9/01/17 False False False
9/12/17 True False True
or I'd just use an integer index, with a column for the date and a column for what file is being sent.
index date file
0 8/01/17 1
1 8/01/17 2
2 8/08/17 2
3 8/15/17 1
4 8/15/17 2
5 8/15/17 3
I mean, if you're locked into using the other setup, that's fine, but I think these would be easier dataframe setups to work with, since they'd support groupby methods more easily.
Also, keep in mind that if you're using a for loop, then you might as well not be using pandas. The whole point of pandas is that it uses C to speed up index operations. Never use
for row in df.index:
df.loc[row, 'columna'] += 2.
Always use
df.loc[:, 'columna'] += 2.
Upvotes: 6