Reputation: 345

select specific rows with numpy

I have an ascii file with 3 columns and multiple rows as following ;

col1, col2, col3
-3,cb,7.9356
-3,mo,13.0402
-2,cb,12.9229
-2,cb,13.2063
-1,va,12.7535

I want to extract an ascii file with the 3 columns but only the rows where col2="cb". I am trying to do this with python using numpy and .

import numpy as np
file="input.txt"
outfile="out.txt"
col1 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 0)
col2 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 1, dtype=np.str)
col3 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 2)

When I use the np.core.defchararray.find I get only the indexes.

cb_rows=np.core.defchararray.find(col2.astype(str),'cb')

How I could save the new file with the 3 columns only for rows that col2 = "cb" (in fact, cb_rows) ?

Upvotes: 0

Answers (3)

Ombrophile

Reputation: 673

If you want to achieve this by using numpy only, then try this:

import numpy as np


a = np.loadtxt('input.txt', dtype=str, delimiter=',')
msk = a[:, 1] == 'cb'
b = a[msk, :]
np.savetxt('out.txt', b, delimiter=',', fmt='%s', header='col1,col2,col3')

Edit

Since the second column of b is now redundant, you can avoid writing it by:

np.savetxt('out.txt', b[:, 0:3:2], delimiter=',', fmt='%s', header='col1,col3')

As is obvious, a and b are string arrays. If, however, you need to do some further processing of the 1st and last columns using their native types, then use the following variables:

cb_col1 = b[:, 0].astype(int)
cb_col3 = b[:, 2].astype(float)

Upvotes: 1

Balaji Ambresh

Reputation: 5012

Seems like there's something wrong with this row. There are 4 columns instead of 3:

-2,cb,1,13.2063

You can use pandas for this:

import pandas as pd

df = pd.read_csv('input.csv')
df = df[df['col2'] == 'cb']
print(df)

Output:

   col1 col2     col3
0    -3   cb   7.9356
2    -2   cb  12.9229
3    -2   cb  13.2063

To enforce string contains behavior use this:

import pandas as pd

df = pd.read_csv('input.csv')
df = df[df['col2'].str.contains('cb')]
print(df)

If the input.csv looks like this:

col1,col2,col3
-3,cb_01,7.9356
-3,mocb,13.0402
-2,1cb2,12.9229
-2,cb,13.2063
-1,va,12.7535

Output

   col1   col2     col3
0    -3  cb_01   7.9356
1    -3   mocb  13.0402
2    -2   1cb2  12.9229
3    -2     cb  13.2063

Upvotes: 1

hpaulj

Reputation: 231665

Your sample, corrected for the extra column in line 5:

In [4]: txt = """col1, col2, col3 
   ...: -3,cb,7.9356 
   ...: -3,mo,13.0402 
   ...: -2,cb,12.9229 
   ...: -2,cb,13.2063 
   ...: -1,va,12.7535"""

Using the full power of genfromtxt I can load it as a structured array:

In [5]: data = np.genfromtxt(txt.splitlines(), delimiter=',', dtype=None, names=True, encoding=None)                                                    
In [6]: data                                                                    
Out[6]: 
array([(-3, 'cb',  7.9356), (-3, 'mo', 13.0402), (-2, 'cb', 12.9229),
       (-2, 'cb', 13.2063), (-1, 'va', 12.7535)],
      dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])

The 'col2' field is already string dtype, so can be tested against a string like 'cb':

In [7]: data['col2']                                                            
Out[7]: array(['cb', 'mo', 'cb', 'cb', 'va'], dtype='<U2')
In [8]: idx=data['col2']=='cb'                                                  
In [9]: idx                                                                     
Out[9]: array([ True, False,  True,  True, False])

we can easily select the records with just 'cb' from data.

In [10]: data[idx]                                                              
Out[10]: 
array([(-3, 'cb',  7.9356), (-2, 'cb', 12.9229), (-2, 'cb', 13.2063)],
      dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])

But if all you want to do is select lines with 'cb' and write them back out, it might easiest to use plain Python file read/write.

Use readlines to fetch all lines:

In [11]: txt1 = txt.splitlines()                                                
In [12]: txt1                                                                   
Out[12]: 
['col1, col2, col3',
 '-3,cb,7.9356',
 '-3,mo,13.0402',
 '-2,cb,12.9229',
 '-2,cb,13.2063',
 '-1,va,12.7535']

and use a simple string find to identify lines that contain 'cb':

In [13]: txt1[1].find('cb')                                                     
Out[13]: 3
In [14]: txt1[0].find('cb')                                                     
Out[14]: -1
In [15]: [row for row in txt1 if row.find('cb')!=-1]                            
Out[15]: ['-3,cb,7.9356', '-2,cb,12.9229', '-2,cb,13.2063']

finally write these lines back out to a file.

Upvotes: 1

select specific rows with numpy

Answers (3)

Related Questions