Reputation: 35
I need a count of non-zero variables in pairs of rows.
I have a dataframe that lists density of species found at several sampling points. I need to know the total number of species found at each pair of sampling points. Here is an example of my data:
>>> import pandas
>>> df = pd.DataFrame({'ID':[111,222,333,444],'minnow':[1,3,5,4],'trout':[2,0,0,3],'bass':[0,1,3,0],'gar':[0,1,0,0]})
>>> df
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
I will pair the rows by ID number, so the pair (111,222) should return a total of 4, while the pair (111,333) should return a total of 3. I know I can get a sum of non-zeros for each row, but if I add those totals for each pair I will be double counting some of the species.
Upvotes: 2
Views: 1041
Reputation: 221564
Here's an approach with NumPy -
In [35]: df
Out[35]:
ID bass gar minnow trout
0 111 0 0 1 2
1 222 1 1 3 0
2 333 3 0 5 0
3 444 0 0 4 3
In [36]: a = df.iloc[:,1:].values!=0
In [37]: r,c = np.triu_indices(df.shape[0],1)
In [38]: l = df.ID
In [39]: pd.DataFrame(np.column_stack((l[r], l[c], (a[r] | a[c]).sum(1))))
Out[39]:
0 1 2
0 111 222 4
1 111 333 3
2 111 444 2
3 222 333 3
4 222 444 4
5 333 444 3
Upvotes: 1
Reputation: 38415
You can do this using iloc for slicing and numpy
np.sum((df.iloc[[0, 1], 1:]!=0).any(axis=0))
Here df.iloc[[0, 1], 1:] gives you first two rows and numpy sum is counting the total number of non zero pairs in the selected row. You can use df.iloc[[0, 1], 1:] to select any combination of rows.
Upvotes: 0
Reputation: 7404
If the rows are sorted so that the two groups occur one after another, you could do
import pandas as pd
import numpy as np
x = np.random.randint(0,2,(10,3))
df = pd.DataFrame(x)
pair_a = df.loc[::2].reset_index(drop = True)
pair_b = df.loc[1::2].reset_index(drop = True)
paired = pd.concat([pair_a,pair_b],axis = 1)
Then find where paired
is non-zero.
Upvotes: 0