Split Pandas Dataframe into separate pieces based on column values

Question

I am looking to perform some Inner Joins in Pandas, using Python 2.7. Here is the dataset that I am working with:

import pandas as pd
import numpy as np

columns = ['s_id', 'c_id', 'c_col1']
index = np.arange(46) # array of numbers for the number of samples
df = pd.DataFrame(columns=columns, index = index)

df.s_id[:15] = 144
df.s_id[15:27] = 105
df.s_id[27:46] = 52

df.c_id[:5] = 1
df.c_id[5:10] = 2
df.c_id[10:15] = 3
df.c_id[15:19] = 1
df.c_id[19:27] = 2
df.c_id[27:34] = 1
df.c_id[34:39] = 2
df.c_id[39:46] = 3

df.c_col1[:5] = ['H', 'C', 'N', 'O', 'S']
df.c_col1[5:10] = ['C', 'O','S','K','Ca']
df.c_col1[10:15] = ['H', 'O','F','Ne','Si']
df.c_col1[15:19] = ['C', 'O', 'F', 'Zn']
df.c_col1[19:27] = ['N', 'O','F','Fe','Zn','Gd','Hg','Pb']
df.c_col1[27:34] = ['H', 'He', 'Li', 'B', 'N','Al','Si']
df.c_col1[34:39] = ['N', 'F','Ne','Na','P']
df.c_col1[39:46] = ['C', 'N','O','F','K','Ca', 'Fe']

Here is the dataframe:

   s_id c_id c_col1
0   144    1      H
1   144    1      C
2   144    1      N
3   144    1      O <--
4   144    1      S
5   144    2      C
6   144    2      O <--
7   144    2      S
8   144    2      K
9   144    2     Ca
10  144    3      H
11  144    3      O <--
12  144    3      F
13  144    3     Ne
14  144    3     Si
15  105    1      C
16  105    1      O
17  105    1      F
18  105    1     Zn
19  105    2      N
20  105    2      O
21  105    2      F
22  105    2     Fe
23  105    2     Zn
24  105    2     Gd
25  105    2     Hg
26  105    2     Pb
27   52    1      H
28   52    1     He
29   52    1     Li
30   52    1      B
31   52    1      N
32   52    1     Al
33   52    1     Si
34   52    2      N
35   52    2      F
36   52    2     Ne
37   52    2     Na
38   52    2      P
39   52    3      C
40   52    3      N
41   52    3      O
42   52    3      F
43   52    3      K
44   52    3     Ca
45   52    3     Fe

I need to do the following in Pandas:

In a given s_id, produce separate dataframes for each c_id value. ex. for s_id = 144, there will be 3 dataframes, while for s_id = 105 there will be 2 dataframes
Inner Join the separate dataframes produced in a.), on the elements column (c_col1) in Pandas. This is a little difficult to understand so here is the dataframe what I would like to get from this step:

index s_id c_id c_col1
```
0   144    1      O
1   144    2      O
2   144    3      O
3   105    1      O
4   105    2      F
5    52    1      N
6    52    2      N
7    52    3      N
```
As you can see, what I am looking for in part 2.) is the following: Within each s_id, I am looking for those c_col1 values that occur for all the c_id values. ex. in the case of s_id = 144, only O (oxygen) occurs for c_id = 1, 2, 3. I have pointed to these entries, with "<--", in the raw data. So, I would like to have the dataframe show O 3 times in the c_col1 column and the corresponding c_id entries would be 1, 2, 3.

Conditions:

the number of unique c_ids are not known ahead of time.i.e. for one particular s_id, I do not know if there will be 1, 2 and 3 or just 1 and 2. This means that if 1, 2 and 3 occur, there will be one Inner Join; if only 1 and 2 occur, then there will be only one Inner Join.

How can this be done with Pandas?

Woody Pride · Accepted Answer

Producing the separate dataframes is easy enough. How would you want to store them? One way would be in a nested dict where the outer keys are the s_id and the inner keys are the c_id and the inner values are the data. That you can do with a fairly long but straightforward dict comprehension:

DF_dict = {s_id : 
          {c_id : df[(df.s_id == s_id) & (df.c_id == c_id)] for c_id in df[df.s_id == s_id]['c_id'].unique()} 
          for s_id in df.s_id.unique()}

Then for example:

In [12]: DF_dict[52][2]
Out[12]:
   s_id c_id c_col1
34   52    2      N
35   52    2      F
36   52    2     Ne
37   52    2     Na
38   52    2      P

I do not understand part two of your question. You want then to join the data within in s_id? Could you show what the expected output would be? If you want to do something within each s_id you might be better off exploring groupby options. Perhaps someone understands what you want, but if you can clarify I might be able to show a better option that skips the first part of the question...

##################EDIT

It seems to me that you should just go straight to problem 2, if problem 1 is simply a step you believe to be necessary to get to a problem 2 solution. In fact it is entirely unnecessary. To solve your second problem you need to group the data by s_id and transform the data according to your requirements. To sum up your requirements as I see them the rule is as follows: For each data group grouped by s_id, return only those ccol_1 data for which there are equal values for each value of c_id.

You might write a function like this:

def c_id_overlap(df):
    common_vals = [] #container for values of c_col1 that are in ever c_id subgroup
    c_ids = df.c_id.unique() #get unique values of c_id
    c_col1_values = set(df.c_col1) # get a set of c_col1 values
    #create nested list of values. Each inner list contains the c_col1 values for each c_id
    nested_c_col_vals = [list(df[df.c_id == ID]['c_col1'].unique()) for ID in c_ids]
    #Iterate through the c_col1_values and see if they are in every nested list
    for val in c_col1_values:
        if all([True if val in elem else False for elem in nested_c_col_vals]):
            common_vals.append(val)
    #return a slice of the dataframe that only contains values of c_col1 that are in every
    #c_id
    return df[df.c_col1.isin(common_vals)]

and then pass it to apply on data grouped by s_id:

df.groupby('s_id', as_index = False).apply(c_id_overlap)

which gives me the following output:

     s_id c_id c_col1
0 31   52    1      N
  34   52    2      N
  40   52    3      N
1 16  105    1      O
  17  105    1      F
  18  105    1     Zn
  20  105    2      O
  21  105    2      F
  23  105    2     Zn
2 3   144    1      O
  6   144    2      O
  11  144    3      O

Which seems to be what you are looking for.

###########EDIT: Additional Explanation:

So apply passes each chunk of grouped data to the function and the the pieces are glues back together once this has been done for each group of data.

So think about the first group passed where s_id == 105. The first line of the function creates an empty list common_vals which will contain those periodic elements that appear in every subgroup of the data (i.e. relative to each of the values of c_id).

The second line gets the unique values of 'c_id', in this case [1, 2] and stores them in an array called c_ids

The third line creates a set of the values of c_col1 which in this case produces:

 {'C', 'F', 'Fe', 'Gd', 'Hg', 'N', 'O', 'Pb', 'Zn'}

The fourth line creates a nested list structure nested_c_col_vals where every inner list is a list of the unique values associated with each of the elements in the c_ids array. In this case this looks like this:

[['C', 'O', 'F', 'Zn'], ['N', 'O', 'F', 'Fe', 'Zn', 'Gd', 'Hg', 'Pb']]

Now each of the elements in the c_col1_values list is iterated over and for each of those elements the program determines whether that element appears in every inner list of the nested_c_col_vals object. The bulit in all function, determines whether every item in the sequence between the backets is True or rather whether it is non-zero (you will need to check this). So:

In [10]: all([True, True, True])
Out[10]: True

In [11]: all([True, True, True, False])
Out[11]: False

In [12]: all([True, True, True, 1])
Out[12]: True

In [13]: all([True, True, True, 0])
Out[13]: False

In [14]: all([True, 1, True, 0])
Out[14]: False

So in this case, let's say 'C' is the first element iterated over. The list comprehension inside the all() backets says, look inside each inner list and see if the element is there. If it is then True if it is not then False. So in this case this resolves to:

all([True, False])

which is of course False. No when the element is 'Zn' the result of this operation is

all([True, True])

which resolves to True. Therefore 'Zn' is appended to the common_vals list.

Once the process is complete the values inside common_vals are:

['O', 'F', 'Zn']

The return statement simply slices the data chunk according to whether the vaues os c_col1 are in the list common_vals as per above.

This is then repeated for each of the remaining groups and the data are glued back together.

Hope this helps

Split Pandas Dataframe into separate pieces based on column values

Answers (1)

Related Questions