Reputation: 345
I have an ascii file with 3 columns and multiple rows as following ;
col1, col2, col3
-3,cb,7.9356
-3,mo,13.0402
-2,cb,12.9229
-2,cb,13.2063
-1,va,12.7535
I want to extract an ascii file with the 3 columns but only the rows where col2="cb". I am trying to do this with python using numpy and .
import numpy as np
file="input.txt"
outfile="out.txt"
col1 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 0)
col2 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 1, dtype=np.str)
col3 = np.loadtxt(file,delimiter=',',skiprows=1,usecols = 2)
When I use the np.core.defchararray.find I get only the indexes.
cb_rows=np.core.defchararray.find(col2.astype(str),'cb')
How I could save the new file with the 3 columns only for rows that col2 = "cb" (in fact, cb_rows) ?
Upvotes: 0
Views: 700
Reputation: 663
If you want to achieve this by using numpy only, then try this:
import numpy as np
a = np.loadtxt('input.txt', dtype=str, delimiter=',')
msk = a[:, 1] == 'cb'
b = a[msk, :]
np.savetxt('out.txt', b, delimiter=',', fmt='%s', header='col1,col2,col3')
Edit
Since the second column of b
is now redundant, you can avoid writing it by:
np.savetxt('out.txt', b[:, 0:3:2], delimiter=',', fmt='%s', header='col1,col3')
As is obvious, a
and b
are string arrays. If, however, you need to do some further processing of the 1st and last columns using their native types, then use the following variables:
cb_col1 = b[:, 0].astype(int)
cb_col3 = b[:, 2].astype(float)
Upvotes: 1
Reputation: 5037
Seems like there's something wrong with this row. There are 4 columns instead of 3:
-2,cb,1,13.2063
You can use pandas for this:
import pandas as pd
df = pd.read_csv('input.csv')
df = df[df['col2'] == 'cb']
print(df)
Output:
col1 col2 col3
0 -3 cb 7.9356
2 -2 cb 12.9229
3 -2 cb 13.2063
To enforce string contains
behavior use this:
import pandas as pd
df = pd.read_csv('input.csv')
df = df[df['col2'].str.contains('cb')]
print(df)
If the input.csv
looks like this:
col1,col2,col3
-3,cb_01,7.9356
-3,mocb,13.0402
-2,1cb2,12.9229
-2,cb,13.2063
-1,va,12.7535
Output
col1 col2 col3
0 -3 cb_01 7.9356
1 -3 mocb 13.0402
2 -2 1cb2 12.9229
3 -2 cb 13.2063
Upvotes: 1
Reputation: 231355
Your sample, corrected for the extra column in line 5:
In [4]: txt = """col1, col2, col3
...: -3,cb,7.9356
...: -3,mo,13.0402
...: -2,cb,12.9229
...: -2,cb,13.2063
...: -1,va,12.7535"""
Using the full power of genfromtxt
I can load it as a structured array:
In [5]: data = np.genfromtxt(txt.splitlines(), delimiter=',', dtype=None, names=True, encoding=None)
In [6]: data
Out[6]:
array([(-3, 'cb', 7.9356), (-3, 'mo', 13.0402), (-2, 'cb', 12.9229),
(-2, 'cb', 13.2063), (-1, 'va', 12.7535)],
dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])
The 'col2' field is already string dtype, so can be tested against a string like 'cb':
In [7]: data['col2']
Out[7]: array(['cb', 'mo', 'cb', 'cb', 'va'], dtype='<U2')
In [8]: idx=data['col2']=='cb'
In [9]: idx
Out[9]: array([ True, False, True, True, False])
we can easily select the records with just 'cb' from data.
In [10]: data[idx]
Out[10]:
array([(-3, 'cb', 7.9356), (-2, 'cb', 12.9229), (-2, 'cb', 13.2063)],
dtype=[('col1', '<i8'), ('col2', '<U2'), ('col3', '<f8')])
But if all you want to do is select lines with 'cb' and write them back out, it might easiest to use plain Python file read/write.
Use readlines
to fetch all lines:
In [11]: txt1 = txt.splitlines()
In [12]: txt1
Out[12]:
['col1, col2, col3',
'-3,cb,7.9356',
'-3,mo,13.0402',
'-2,cb,12.9229',
'-2,cb,13.2063',
'-1,va,12.7535']
and use a simple string find
to identify lines that contain 'cb':
In [13]: txt1[1].find('cb')
Out[13]: 3
In [14]: txt1[0].find('cb')
Out[14]: -1
In [15]: [row for row in txt1 if row.find('cb')!=-1]
Out[15]: ['-3,cb,7.9356', '-2,cb,12.9229', '-2,cb,13.2063']
finally write these lines back out to a file.
Upvotes: 1