Sadcow
Sadcow

Reputation: 730

Find the values from another data frame with repetitive ids using another frame with unique id in python

I really stuck in this problem and dont have any idea how to solve that. I have two data frame, one is for the humidity and its data are reported every 15 minutes. I have three different sensors for reporting. So, the table includes the id, the date, and hour of the reporting. Here is:

df_h = pd.DataFrame({'id_h': {0: 1, 1: 1, 2: 2, 3: 2, 4: 3, 5: 3}, 'date': {0: '2021-01-01', 1: '2021-01-01', 2: '2021-01-01', 3: '2021-01-01', 4: '2021-01-01', 5: '2021-01-01'}, 'time_hour': {0: '6:00:00', 1: '6:15:00', 2: '6:00:00', 3: '6:15:00', 4: '6:00:00', 5: '6:15:00'}, 'VALUE': {0: 10, 1: 12, 2: 20, 3: 22, 4: 30, 5: 32}})

   id_h        date time_hour  VALUE
0     1  2021-01-01   6:00:00     10
1     1  2021-01-01   6:15:00     12
2     2  2021-01-01   6:00:00     20
3     2  2021-01-01   6:15:00     22
4     3  2021-01-01   6:00:00     30
5     3  2021-01-01   6:15:00     32

with the following code, I can stick its data together and for each id, in each day, I have the humidity.

humidity_sticked = df_h.pivot(index=["id_h", "date"], columns="time_hour", values="VALUE")
humidity_sticked.columns = [f"value_{i+1}" for i in range(humidity_sticked.shape[1])]
humidity_sticked  =humidity_sticked.reset_index()
As we can see, we have a table with three rows and two columns.

Also, I have another table which shows the temperature. But, the id for the weather center is different. For example, for id_h (id of humidity) = 1, 2 we only have the id_t (id of temperature) = 5. So, we have exact same table for the temperature, but since the ids are different, I can not create the same stick table as humidity. Here is the table for the temperature:

df_t = pd.DataFrame({'id_t': {0: 5, 1: 5, 2: 5, 3: 5, 4: 7}, 'date': {0: '2021-01-01', 1: '2021-01-01', 2: '2021-01-01', 3: '2021-01-01', 4: '2021-01-01'}, 'time_hour': {0: '6:00:00', 1: '6:15:00', 2: '6:00:00', 3: '6:15:00', 4: '6:00:00'}, 'VALUE': {0: -1, 1: -8, 2: -2, 3: -9, 4: -3}})

   id_t        date time_hour  VALUE
0     5  2021-01-01   6:00:00     -1
1     5  2021-01-01   6:15:00     -8
2     5  2021-01-01   6:00:00     -2
3     5  2021-01-01   6:15:00     -9
4     7  2021-01-01   6:00:00     -3

When I want to stick the values for id_t=5, I got an error. The desired output which I want is: enter image description here

Explanation: for id_h=1,2 we have two 5. So, for the first two rows we consider as 1, the second two rows as id=2 and the last two rows are for id=3 which are for id_t=7.

Any help can save me!Thanks

update: I've used the merge by the index, however, when I have missing values in one of the data frame, (for example for a specific date, at time 6:00 I have the humidity, but I don't have the temperature). The results are wrong. Here is the the results of the merge by the index, we can see that the time is not same, but it put all them in one row. enter image description here

Upvotes: 0

Views: 68

Answers (2)

jeongwhanchoi
jeongwhanchoi

Reputation: 136

You can use the pd.merge by index. This way is the shortcut to make your 'sticked dataframe'.

pd.merge(df_t, df_h, left_index=True, right_index=True, suffixes=['_t', '_h'])

Output:

 id_t      date_t time_hour_t  VALUE_t  id_h      date_h time_hour_h  \
0     5  2021-01-01     6:00:00       -1     1  2021-01-01     6:00:00   
1     5  2021-01-01     6:15:00       -8     1  2021-01-01     6:15:00   
2     5  2021-01-01     6:00:00       -2     2  2021-01-01     6:00:00   
3     5  2021-01-01     6:15:00       -9     2  2021-01-01     6:15:00   
4     7  2021-01-01     6:00:00       -3     3  2021-01-01     6:00:00   

   VALUE_h  
0       10  
1       12  
2       20  
3       22  
4       30  

The output above contains useless columns, so you can merge df_t and df_h[only you need to merge] like below:

pd.merge(df_t, df_h[['id_h','VALUE']], left_index=True, right_index=True, suffixes=['_t', '_h'])

Output:

   id_t        date time_hour  VALUE_t  id_h  VALUE_h
0     5  2021-01-01   6:00:00       -1     1       10
1     5  2021-01-01   6:15:00       -8     1       12
2     5  2021-01-01   6:00:00       -2     2       20
3     5  2021-01-01   6:15:00       -9     2       22
4     7  2021-01-01   6:00:00       -3     3       30

This is the simplest way you want.

Upvotes: 1

BeRT2me
BeRT2me

Reputation: 13242

df_t['rank'] = df_t.id_t.rank(method='dense')
df_h['rank'] = df_h.id_h.rank(method='dense')
df = df_t.merge(df_h, on=['rank', 'date', 'time_hour'], suffixes=['_1', '_2'])
print(df)

Output:

   id_t        date time_hour  VALUE_1  rank  id_h  VALUE_2
0     5  2021-01-01   6:00:00       -1   1.0     1       10
1     5  2021-01-01   6:00:00       -2   1.0     1       10
2     5  2021-01-01   6:15:00       -8   1.0     1       12
3     5  2021-01-01   6:15:00       -9   1.0     1       12
4     7  2021-01-01   6:00:00       -3   2.0     2       20

Upvotes: 2

Related Questions