Reputation: 1017
I have a dtaframe and I'd like add at it's end 2 rows that will indicate how many cells were between a range of numbers. I'd like to do it to all columns, besides the first and last (I have a big dataframe with a lot of columns). For example, I have the following small scale dataframe:
start position A b rnd
0 149449305 4 99 88
1 47630133 6 50 10
2 128200594 12 7 600
3 49423101 100 10 8
I'd like to count how many times a number is between a range of 0-9, and 10-100 in columns A and B only, and to add it to the end of the dataframe, with the range mentioned at the and of the 'start position' column, like so:
start position A b rnd
0 149449305 4 99 88
1 47630133 6 50 10
2 128200594 12 7 600
3 49423101 100 10 8
4 0-9 2 1
5 10-100 2 3
How can I do that (without indicating the names of the columns, rather a range of their index, since I have a much larger dataframe with many columns)?
Upvotes: 1
Views: 1301
Reputation: 917
You just have to jot down your intervals in the list only once.
intervals=[(0,9), (10,100)]
outside = []
for a,b in intervals:
inside = []
for col in df.columns[1:-1]:
inside.append(df[col].between(a, b).sum())
outside.append(inside)
new = []
for i, interval in enumerate(intervals):
new.append(['-'.join(map(str,interval))] + outside[i] + [''])
new_df = pd.concat([df, pd.DataFrame(new, columns=df.columns)], axis=0, ignore_index=True)
start position A b rnd
0 149449305 4 99 88
1 47630133 6 50 10
2 128200594 12 7 600
3 49423101 100 10 8
4 0-9 2 1
5 10-100 2 3
and it takes,
5.14 ms ± 713 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Upvotes: 0
Reputation: 1845
You can calculate the last 2 rows as a separate dataframe, and than concatenate the 2 dataframes:
df1 = ...
df2 = pd.DataFrame()
df2['start position'] = ['0-9', '10-99']
df2['A'] = [df1['A'][df1['A'] >= 0][df1['A'] <= 9].count(), df1['A'][df1['A'] >= 10][df1['A'] <= 100].count()]
df2['B'] = [df1['B'][df1['B'] >= 0][df1['B'] <= 9].count(), df1['B'][df1['B'] >= 10][df1['B'] <= 100].count()]
result_df = pd.concat([df1, df2])[['start position','A','B','rnd']].reset_index()
Upvotes: 1
Reputation: 323226
IIUC, using pd.cut
with value_counts
get the range count , then we using append
newdf=df.iloc[:,1:-1].apply(lambda x : pd.cut(x,[0,9,100],labels=['0-9','10-100']).value_counts())
df.append(newdf.rename_axis('startposition',axis=0).reset_index())
Out[216]:
A b rnd startposition
0 4 99 88.0 149449305
1 6 50 10.0 47630133
2 12 7 600.0 128200594
3 100 10 8.0 49423101
0 2 3 NaN 10-100
1 2 1 NaN 0-9
df=df.append(newdf.rename_axis('startposition',axis=0).reset_index()).reindex(df.columns,axis=1)
df
Out[217]:
startposition A b rnd
0 149449305 4 99 88.0
1 47630133 6 50 10.0
2 128200594 12 7 600.0
3 49423101 100 10 8.0
0 10-100 2 3 NaN
1 0-9 2 1 NaN
Upvotes: 3