Reputation: 1644
I have a large csv file 5300 columns by 150 rows. Each column has a title which is a number (except the first column which is a string). Each row has a sample name associated with it. The following is an example of the format:
s,1,2,3,4,5,6
a,0,8,7,6,0,5
b,0,0,5,4,0,3
c,0,2,3,4,0,3
d,0,6,7,8,0,9
I wish to remove all columns who's sum (excluding the column title) is zero. What is the fastest way to do this? I tried iterating over the whole thing by spliting each line, adding the 'ith' element to a list and summing. If the sum == 0 then iterating over the data again and deleting the 'ith' element from each row. Problem with this method is it takes an unpractical amount of time and I may need to repeat the process for several similar files.
I have some python experience but am completely new to numpy. I am unsure how to delete by columns according to a condition (such as sum == 0). So far I have:
data = np.loadtxt('test.csv', delimiter=',', skiprows=1, usecols=range(1,5))
I am thinking to use something like:
data = delete(data[:,i:])
but don't know how to check that the sum of the column == 0 and how to apply across the array. If anyone can expand on this or has a better way then please let me know. Thanks
Here is my exact code:
import numpy as np
with open('test.csv','r') as r:
all_data = [line.strip() for line in r]
titles = all_data[0].split(',')
samples = []
for i in all_data:
samples.append(i.split(',')[0])
print titles
print samples
data = np.loadtxt('test.csv', delimiter=',', skiprows=1, usecols=range(1,5))
print data
numeric_data = data[1:, 1:]
columns_to_keep = np.concatenate([[True], np.sum(numeric_data, axis = 0) != 0])
print columns_to_keep
new_data = data[:, columns_to_keep]
print new_data
Here is the output:
['s', '1', '2', '3', '4', '5', '6']
['s', 'a', 'b', 'c', 'd']
[[ 0. 8. 7. 6.]
[ 0. 0. 5. 4.]
[ 0. 2. 3. 4.]
[ 0. 6. 7. 8.]]
[ True True True True]
[[ 0. 8. 7. 6.]
[ 0. 0. 5. 4.]
[ 0. 2. 3. 4.]
[ 0. 6. 7. 8.]]
I need it so the 1st column is removed because its sum is '0'
Upvotes: 2
Views: 1569
Reputation: 1
In the event someone is interested in how to extend this to an array with more than two dimensions, a somewhat more streamlined way to accomplish this task would be as follows.
Assume the array is a 3-D array (e.g. time, lat, lon) and that some of the time steps have values of zero for all latitudes and longitudes, and you would like to delete those time steps.
new_data = data[data.sum(axis=(1,2))>0]
Upvotes: 0
Reputation: 13485
Should be something like:
numeric_data = data[1:, 1:]
columns_to_keep = np.concatenate([[True], np.sum(numeric_data, axis = 0) != 0])
# Prepending True means you keep the first column
new_data = data[:, columns_to_keep])
Upvotes: 5