user3062260
user3062260

Reputation: 1644

delete column in numpy if a condition is true

I have a large csv file 5300 columns by 150 rows. Each column has a title which is a number (except the first column which is a string). Each row has a sample name associated with it. The following is an example of the format:

 s,1,2,3,4,5,6
 a,0,8,7,6,0,5
 b,0,0,5,4,0,3
 c,0,2,3,4,0,3
 d,0,6,7,8,0,9

I wish to remove all columns who's sum (excluding the column title) is zero. What is the fastest way to do this? I tried iterating over the whole thing by spliting each line, adding the 'ith' element to a list and summing. If the sum == 0 then iterating over the data again and deleting the 'ith' element from each row. Problem with this method is it takes an unpractical amount of time and I may need to repeat the process for several similar files.

I have some python experience but am completely new to numpy. I am unsure how to delete by columns according to a condition (such as sum == 0). So far I have:

 data = np.loadtxt('test.csv', delimiter=',', skiprows=1, usecols=range(1,5))

I am thinking to use something like:

 data = delete(data[:,i:])

but don't know how to check that the sum of the column == 0 and how to apply across the array. If anyone can expand on this or has a better way then please let me know. Thanks

Here is my exact code:

 import numpy as np

 with open('test.csv','r') as r:
     all_data = [line.strip() for line in r]

 titles = all_data[0].split(',')
 samples = []
 for i in all_data:
     samples.append(i.split(',')[0])

 print titles
 print samples

 data = np.loadtxt('test.csv', delimiter=',', skiprows=1, usecols=range(1,5))

 print data

 numeric_data = data[1:, 1:]
 columns_to_keep = np.concatenate([[True], np.sum(numeric_data, axis = 0) != 0])
 print columns_to_keep
 new_data = data[:, columns_to_keep]

 print new_data 

Here is the output:

 ['s', '1', '2', '3', '4', '5', '6']
 ['s', 'a', 'b', 'c', 'd']
 [[ 0.  8.  7.  6.]
  [ 0.  0.  5.  4.]
  [ 0.  2.  3.  4.]
  [ 0.  6.  7.  8.]]
 [ True  True  True  True]
 [[ 0.  8.  7.  6.]
  [ 0.  0.  5.  4.]
  [ 0.  2.  3.  4.]
  [ 0.  6.  7.  8.]]  

I need it so the 1st column is removed because its sum is '0'

Upvotes: 2

Views: 1569

Answers (2)

fastmath
fastmath

Reputation: 1

In the event someone is interested in how to extend this to an array with more than two dimensions, a somewhat more streamlined way to accomplish this task would be as follows.

Assume the array is a 3-D array (e.g. time, lat, lon) and that some of the time steps have values of zero for all latitudes and longitudes, and you would like to delete those time steps.

new_data  = data[data.sum(axis=(1,2))>0]

Upvotes: 0

Peter
Peter

Reputation: 13485

Should be something like:

numeric_data = data[1:, 1:]
columns_to_keep = np.concatenate([[True], np.sum(numeric_data, axis = 0) != 0])
# Prepending True means you keep the first column
new_data = data[:, columns_to_keep])

Upvotes: 5

Related Questions