Reputation: 12406
I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:
import pandas as pd
import numpy as np
columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)
df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52
df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3
df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']
Here is the dataframe:
s_id c_id c_col1
0 144 1 H
1 144 1 C
2 144 1 N
3 144 1 O <--
4 144 1 S
5 144 2 C
6 144 2 O <--
7 144 2 S
8 144 2 K
9 144 2 Ca
10 144 3 H
11 144 3 O <--
12 144 3 F
13 144 3 Ne
14 144 3 Si
15 105 1 C
16 105 1 O
17 105 1 F
18 105 1 Zn
19 105 2 N
20 105 2 O
21 105 2 F
22 105 2 Fe
23 105 2 Zn
24 105 2 Gd
25 105 2 Hg
26 105 2 Pb
27 52 1 H
28 52 1 He
29 52 1 Li
30 52 1 B
31 52 1 N
32 52 1 Al
33 52 1 Si
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
39 52 3 C
40 52 3 N
41 52 3 O
42 52 3 F
43 52 3 K
44 52 3 Ca
45 52 3 Fe
I need to do the following in Pandas:
Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:
index s_id c_id c_col1
0 144 1 O
1 144 2 O
2 144 3 O
3 105 1 O
4 105 2 F
5 52 1 N
6 52 2 N
7 52 3 N
As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.
Conditions:
How can this be done with Pandas?
Upvotes: 1
Views: 2716
Reputation: 13965
Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:
DF_dict = {s_id :
{c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()}
for s_id in df.s_id.unique()}
Then for example:
In [12]: DF_dict[52][2]
Out[12]:
s_id c_id c_col1
34 52 2 N
35 52 2 F
36 52 2 Ne
37 52 2 Na
38 52 2 P
I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...
##################EDITIt seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.
You might write a function like this:
def c_id_overlap(df):
common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
c_ids = df.c_id.unique() #get unique values of c_id
c_col1_values = set(df.c_col1) # get a set of c_col1 values
#create nested list of values. Each inner list contains the c_col1 values for each c_id
nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
#Iterate through the c_col1_values and see if they are in every nested list
for val in c_col1_values:
if all([True if val in elem else False for elem in nested_c_col_vals]):
common_vals.append(val)
#return a slice of the dataframe that only contains values of c_col1 that are in every
#c_id
return df[df.c_col1.isin(common_vals)]
and then pass it to apply
on data grouped by s_id:
df.groupby('s_id', as_index = False).apply(c_id_overlap)
which gives me the following output:
s_id c_id c_col1
0 31 52 1 N
34 52 2 N
40 52 3 N
1 16 105 1 O
17 105 1 F
18 105 1 Zn
20 105 2 O
21 105 2 F
23 105 2 Zn
2 3 144 1 O
6 144 2 O
11 144 3 O
Which seems to be what you are looking for.
###########EDIT: Additional Explanation:So apply
passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.
So think about the first group passed where s_id
== 105. The first line of the function creates an empty list common_vals
which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id
).
The second line gets the unique values of 'c_id', in this case [1, 2]
and stores them in an array called c_ids
The third line creates a set of the values of c_col1
which in this case produces:
{'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}
The fourth line creates a nested list structure nested_c_col_vals
where every inner list is a list of the unique values associated with each of the elements in the c_ids
array. In this case this looks like this:
[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]
Now each of the elements in the c_col1_values
list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals
object. The bulit in all
function, determines whether every item in the sequence between the backets is True
or rather whether it is non-zero (you will need to check this). So:
In [10]: all([True, True, True])
Out[10]: True
In [11]: all([True, True, True, False])
Out[11]: False
In [12]: all([True, True, True, 1])
Out[12]: True
In [13]: all([True, True, True, 0])
Out[13]: False
In [14]: all([True, 1, True, 0])
Out[14]: False
So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all()
backets says, look inside each inner list and see if the element is there. If it is then True
if it is not then False
. So in this case this resolves to:
all([True, False])
which is of course False
. No when the element is 'Zn' the result of this operation is
all([True, True])
which resolves to True
. Therefore 'Zn' is appended to the common_vals
list.
Once the process is complete the values inside common_vals
are:
['O', 'F', 'Zn']
The return statement simply slices the data chunk according to whether the vaues os c_col1
are in the list common_vals
as per above.
This is then repeated for each of the remaining groups and the data are glued back together.
Hope this helps
Upvotes: 2