Reputation: 31
I am fairly new to python. I am working with a csv file that contains the following
Name, Description, Price
Comp1, comp desc, 60
Comp1, comp desc, 100
Comp1, comp desc, 250
Comp1, comp desc, 90
Comp1, comp desc, 125
Comp1, comp desc, 600
Comp1, comp desc, 395
...
I am trying to read the file, if there are any duplicates (based on name), remove those. Then if the price < 50 remove those. Then add a new column called 'Quality'. then if the price is < 80 set the "quality for that row to "Average" , if the price is < 125 set price to "High".
I am trying to use Pandas for this (I tried using the csv module but could not figure it out after 3 days...)
so far I am able to add a new column. I am not sure how to delete a row if it meets a condition (surprisingly, I could not find anything that matched my criteria in the posts)
I tried the following but I get an error saying "the truth value of a series is ambiguous...
code:
#python 3.6
import pandas as pd
csv_input = pd.read_csv("d:\python programs\chairs.csv")
csv_input["Quality"] = ""
csv_input.loc[csv_input["Price"] > 50 and csv_input["Price"] < 125, "Quality"] = "Average"
csv_input.loc[csv_input["Price"] > 125, "Quality" = "High"]
csv_input.to_csv("d:\python programs\output2.csv", index=False)
print (csv_input.iloc[:, 2])
the last line is just a check that I am working with any help is appreciated.
Thanks
Update: I was able to update the code to get it to do some of what I want, however I still need to understand how to delete rows where price >= 50 (example).
Here is the updated code:
import pandas as pd
csv_input = pd.read_csv("d:\python programs\chairs.csv")
csv_input["Quality"] = ""
'This line set the Quality to average if the value is ,= 125'
csv_input.loc[csv_input["Price"] <= 125, "Quality"] = "Average"
'This line sets the Quality to high if the Price is above 125'
csv_input.loc[csv_input["Price"] > 125, "Quality"] = "High"
'This line writes to an output file'
csv_input.to_csv("d:\python programs\output2.csv", index=False)
print (csv_input.iloc[:, 2])
Thanks again for all the comments and the help. I appreciate it.
Upvotes: 2
Views: 688
Reputation: 470
First let's recreate your DF
import pandas as pd
import numpy as np
# Create DF
csv_input = pd.DataFrame({'Name': ['Comp1', 'Comp2', 'Comp3', 'Comp1', 'Comp2', 'Comp3', 'Comp1', 'Comp2', 'Comp3'],
'Description': ['comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc'],
'Price': [60, 30, 180, 100, 150, 160, 170, 200, 80]})
Now here's code you need
# Read File
csv_input = pd.read_csv("d:\python programs\chairs.csv")
# Remove duplicates
csv_input = csv_input.drop_duplicates(subset=['Name'])
# Remove records with price under 50
csv_input = csv_input[csv_input['Price'] >= 50]
# Create Quality Column
csv_input["Quality"] = np.where(csv_input['Price']>125, 'High', 'Average')
# Save to CSV
csv_input.to_csv("d:\python programs\output2.csv", index=False)
Upvotes: 0
Reputation: 3790
Use &
operator instead of and
in indexes, also wrap conditions up in parenthesesif they are separated by some logical operator:
import pandas as pd
csv_input = pd.read_csv("d:\python programs\chairs.csv")
csv_input["Quality"] = ""
csv_input.loc[(csv_input["Price"] > 50) & (csv_input["Price"] < 125), "Quality"] = "Average"
# in the next string '[' just moved to the right place
csv_input.loc[csv_input["Price"] > 125, "Quality"] = "High"
csv_input
Out:
Name Description Price Quality
0 Comp1 comp desc 60 Average
1 Comp1 comp desc 100 Average
2 Comp1 comp desc 250 High
3 Comp1 comp desc 90 Average
4 Comp1 comp desc 125
5 Comp1 comp desc 600 High
6 Comp1 comp desc 395 High
Explanation:
csv_input["Price"] < 125
returns following pd.Series with boolean values:
0 True
1 True
2 False
3 True
4 False
5 False
6 False
In python, and
kwd can't be overloaded, but the &
or other (bitwise) logical operators operator can.
So, (csv_input["Price"] > 50) & (csv_input["Price"] < 125)
returns pd.Series again (more info about indexing):
0 True
1 True
2 False
3 True
4 False
5 False
6 False
This series will be used as boolean mask for the .loc[]
. (Looks the same as the former series because (csv_input["Price"] > 50)
all True
)
If you wish to drop duplicated lines, try pd.DataFrame.drop_duplicates().
Upvotes: 1
Reputation: 917
You can remove the rows by doing,
csv_input = csv_input.loc[csv_input['Price'] > 50]
Upvotes: 0