delete column in numpy if a condition is true

Question

I have a large csv file 5300 columns by 150 rows. Each column has a title which is a number (except the first column which is a string). Each row has a sample name associated with it. The following is an example of the format:

 s,1,2,3,4,5,6
 a,0,8,7,6,0,5
 b,0,0,5,4,0,3
 c,0,2,3,4,0,3
 d,0,6,7,8,0,9

I wish to remove all columns who's sum (excluding the column title) is zero. What is the fastest way to do this? I tried iterating over the whole thing by spliting each line, adding the 'ith' element to a list and summing. If the sum == 0 then iterating over the data again and deleting the 'ith' element from each row. Problem with this method is it takes an unpractical amount of time and I may need to repeat the process for several similar files.

I have some python experience but am completely new to numpy. I am unsure how to delete by columns according to a condition (such as sum == 0). So far I have:

 data = np.loadtxt('test.csv', delimiter=',', skiprows=1, usecols=range(1,5))

I am thinking to use something like:

 data = delete(data[:,i:])

but don't know how to check that the sum of the column == 0 and how to apply across the array. If anyone can expand on this or has a better way then please let me know. Thanks

Here is my exact code:

 import numpy as np

 with open('test.csv','r') as r:
     all_data = [line.strip() for line in r]

 titles = all_data[0].split(',')
 samples = []
 for i in all_data:
     samples.append(i.split(',')[0])

 print titles
 print samples

 data = np.loadtxt('test.csv', delimiter=',', skiprows=1, usecols=range(1,5))

 print data

 numeric_data = data[1:, 1:]
 columns_to_keep = np.concatenate([[True], np.sum(numeric_data, axis = 0) != 0])
 print columns_to_keep
 new_data = data[:, columns_to_keep]

 print new_data

Here is the output:

 ['s', '1', '2', '3', '4', '5', '6']
 ['s', 'a', 'b', 'c', 'd']
 [[ 0.  8.  7.  6.]
  [ 0.  0.  5.  4.]
  [ 0.  2.  3.  4.]
  [ 0.  6.  7.  8.]]
 [ True  True  True  True]
 [[ 0.  8.  7.  6.]
  [ 0.  0.  5.  4.]
  [ 0.  2.  3.  4.]
  [ 0.  6.  7.  8.]]

I need it so the 1st column is removed because its sum is '0'

Peter · Accepted Answer

Should be something like:

numeric_data = data[1:, 1:]
columns_to_keep = np.concatenate([[True], np.sum(numeric_data, axis = 0) != 0])
# Prepending True means you keep the first column
new_data = data[:, columns_to_keep])

delete column in numpy if a condition is true

Answers (2)

Related Questions