Python/Pandas - New column based on multiple conditions

Question

I want to create a new column, which has to adhere to multiple conditions in multiple places. I have three dataframes: - The one I'd like to add the column to: data

ID Timestamp
25 201605252300 #Y/M/D/H/M
29 201605252315
94 201605252015

- The one that contains info on data availability: dataavail

Timestamp    Availability
201605252300 True
201605252315 False
201605252015 True

- The one that contains the actual data weatherdata

ID Timestamp    Precipitation
25 201605252300 1

The thing is, that if there is no rain, that is not marked by weatherdata.Precipitation being 0, but by simply missing from weatherdata. However, not all data that is missing means no precipitation, it can also mean technical failure. That's where dataavail comes in. If for a Timestamp in data the Availability is False, that means there was a technical problem and there's no data. If for a Timestamp in data the Availability is True, that means the precipitation was actually 0.

Now, I'd like to combine this info into the data field. So: IF the Timestamp/ID pair from data exists in weatherdata, it should simply pull the info from there. IF the Timestamp/ID pair from data DOES NOT exists in weatherdata, it should check the dataavail dataframe AND IF the Availability is True, the Precipitation should be 0, otherwise is should be NA/NaN so I can impute it later.

So for the above example, the desired output would be

ID Timestamp    Precipitation
25 201605252300 1
29 201605252315 NaN
94 201605252015 0

What's the syntax to use in such a setup, where mutiple conditions have to be satisfied, based on sub-conditions? Thank you!

UPDATE

Tried jezrael's solution. I have the following dataframes:

So then I do mergedAll = pd.merge(imp_df, precip, on=['SId','Time'], how='left') on them. The result:

I've checked programatically, and ALL the data in Intensity is NaN. dtypes match:

UPDATE 2

After sorting my data out, the ˙mask˙ part still doesn't work:

Even though 201311292300,TRUE is in the availability dataframe, the Intensity gets translated to NaN instead of 0.

jezrael · Accepted Answer

I think you need:

#add weatherdata data if exist
df = pd.merge(data, weatherdata, on=['ID','Timestamp'], how='left')

#get Timestamp values where True in column Availability
vals = dataavail.loc[dataavail.Availability, 'Timestamp']
print (vals)
0    201605252300
2    201605252015
Name: Timestamp, dtype: int64

#True if exist Timistamps in vals and if not exist in df (NaN)
mask = df.Timestamp.isin(vals) & df.Precipitation.isnull()
#convert True to 0
df.Precipitation = df.Precipitation.mask(mask, 0)
print (df)
   ID     Timestamp  Precipitation
0  25  201605252300            1.0
1  29  201605252315            NaN
2  94  201605252015            0.0

Python/Pandas - New column based on multiple conditions

Answers (1)

Related Questions