Reputation: 87
I'm new in Python and I have a problem of removing unwanted rows in a csv file. For instance I have 3 columns and a lot of rows:
A B C
hi 1.0 5
hello 2.0 6
ok 3.0 7
I loaded the data using numpy (instead of csv)
import numpy as np
a= np.loadtxt('data.csv' , delimiter= ',' , skiprows= 1)
I want to introduce a range for the 2nd column
b=np.arange(0, 2.1,0.1)
I don't have any idea how I should use that piece of code.
What I want as a final output is the following:
A B C
hi 1.0 5
hello 2.0 6
The last row would be remove since I chose a range for the 2nd column up to 2.0 only. I don't have any idea how can I accomplish this.
Upvotes: 1
Views: 258
Reputation: 87074
You can just use the csv module. N.B the following expects that the CSV fields are comma separated, not tab separated (as your sample suggests).
import csv
with open('data.csv') as data:
reader = csv.reader(data) # or csv.reader(data, delimiter='\t') for tabs
field_names = next(reader)
filtered_rows = [row for row in reader if 0 <= float(row[1]) <= 2.0]
>>> field_names
['A', 'B', 'C']
>>> filtered_rows
[['hi', '1.0', '5'], ['hello', '2.0', '6']]
>>> filtered_rows.insert(0, field_names)
>>> filtered_rows
[['A', 'B', 'C'], ['hi', '1.0', '5'], ['hello', '2.0', '6']]
If you require that values be exact tenths within the required range, then you can do this:
import csv
import numpy as np
allowed_values = np.arange(0, 2.1, 0.1)
with open('data.csv') as data:
reader = csv.reader(data)
field_names = next(reader)
filtered_rows = [row for row in reader if float(row[1]) in allowed_values]
Edit after updated requirements
With extra constraints on column "C", e.g. value must be >= 6.
import csv
import numpy as np
allowed_values_B = np.arange(0, 2.1, 0.1)
def accept_row(row):
return (float(row[1]) in allowed_values_B) and (int(row[2]) >= 6)
with open('data.csv') as data:
reader = csv.reader(data)
field_names = next(reader)
filtered_rows = [row for row in reader if accept_row(row)]
>>> filtered_rows
[['hello', '2.0', '6']]
Upvotes: 1
Reputation: 21873
Try with Pandas:
import pandas as pd
a = pd.read_csv('data.csv', index_col=0) # column A will be the index.
a
B C
A
hi 1 5
hello 2 6
ok 3 7
For every value of B up to 2 :
a[a.B <= 2]
B C
A
hi 1 5
hello 2 6
Details :
a.B
A
hi 1
hello 2
ok 3
Name: B, dtype: float64
a.B <= 2
A
hi True
hello True
ok False
Name: B, dtype: bool
Upvotes: 2
Reputation: 29081
You can do it using logical indexing
index = (x[:, 1] <= 2.0)
Then
x = x[index]
selecting only the lines that satisfy this condition
Upvotes: 1