PythonNewb
PythonNewb

Reputation: 31

Need to remove rows and update other rows when conditions are met in a csv using pandas

I am fairly new to python. I am working with a csv file that contains the following

Name, Description, Price
Comp1, comp desc, 60
Comp1, comp desc, 100
Comp1, comp desc, 250
Comp1, comp desc, 90
Comp1, comp desc, 125
Comp1, comp desc, 600
Comp1, comp desc, 395

...

I am trying to read the file, if there are any duplicates (based on name), remove those. Then if the price < 50 remove those. Then add a new column called 'Quality'. then if the price is < 80 set the "quality for that row to "Average" , if the price is < 125 set price to "High".

I am trying to use Pandas for this (I tried using the csv module but could not figure it out after 3 days...)

so far I am able to add a new column. I am not sure how to delete a row if it meets a condition (surprisingly, I could not find anything that matched my criteria in the posts)

I tried the following but I get an error saying "the truth value of a series is ambiguous...

code:

    #python 3.6
    import pandas as pd

    csv_input = pd.read_csv("d:\python programs\chairs.csv")
    csv_input["Quality"] = ""

    csv_input.loc[csv_input["Price"] > 50 and csv_input["Price"] < 125, "Quality"] = "Average"
    csv_input.loc[csv_input["Price"] > 125, "Quality" = "High"]

    csv_input.to_csv("d:\python programs\output2.csv", index=False)

    print (csv_input.iloc[:, 2])

the last line is just a check that I am working with any help is appreciated.

Thanks

Update: I was able to update the code to get it to do some of what I want, however I still need to understand how to delete rows where price >= 50 (example).

Here is the updated code:

    import pandas as pd

    csv_input = pd.read_csv("d:\python programs\chairs.csv")
    csv_input["Quality"] = ""

    'This line set the Quality to average if the value is ,= 125'

    csv_input.loc[csv_input["Price"] <= 125, "Quality"] = "Average" 

    'This line sets the Quality to high if the Price is above 125'

    csv_input.loc[csv_input["Price"] > 125, "Quality"] = "High"

    'This line writes to an output file'

    csv_input.to_csv("d:\python programs\output2.csv", index=False)



    print (csv_input.iloc[:, 2])

Thanks again for all the comments and the help. I appreciate it.

Upvotes: 2

Views: 688

Answers (3)

Sergei
Sergei

Reputation: 470

First let's recreate your DF

import pandas as pd
import numpy as np

# Create DF
csv_input = pd.DataFrame({'Name': ['Comp1', 'Comp2', 'Comp3', 'Comp1', 'Comp2', 'Comp3', 'Comp1', 'Comp2', 'Comp3'],
                          'Description': ['comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc', 'comp desc'],
                          'Price': [60, 30, 180, 100, 150, 160, 170, 200, 80]})

Now here's code you need

# Read File
csv_input = pd.read_csv("d:\python programs\chairs.csv")

# Remove duplicates
csv_input = csv_input.drop_duplicates(subset=['Name'])

# Remove records with price under 50
csv_input =  csv_input[csv_input['Price'] >= 50]

# Create Quality Column
csv_input["Quality"] = np.where(csv_input['Price']>125, 'High', 'Average')

# Save to CSV    
csv_input.to_csv("d:\python programs\output2.csv", index=False)

Upvotes: 0

Mikhail Stepanov
Mikhail Stepanov

Reputation: 3790

Use & operator instead of and in indexes, also wrap conditions up in parenthesesif they are separated by some logical operator:

import pandas as pd

csv_input = pd.read_csv("d:\python programs\chairs.csv")
csv_input["Quality"] = ""

csv_input.loc[(csv_input["Price"] > 50) & (csv_input["Price"] < 125), "Quality"] = "Average"
# in the next string '[' just moved to the right place
csv_input.loc[csv_input["Price"] > 125, "Quality"] = "High"

csv_input
Out:
       Name     Description     Price   Quality
0     Comp1       comp desc     60      Average
1     Comp1       comp desc     100     Average
2     Comp1       comp desc     250        High
3     Comp1       comp desc     90      Average
4     Comp1       comp desc     125     
5     Comp1       comp desc     600        High
6     Comp1       comp desc     395        High

Explanation: csv_input["Price"] < 125 returns following pd.Series with boolean values:

0     True
1     True
2    False
3     True
4    False
5    False
6    False

In python, and kwd can't be overloaded, but the & or other (bitwise) logical operators operator can. So, (csv_input["Price"] > 50) & (csv_input["Price"] < 125) returns pd.Series again (more info about indexing):

0     True
1     True
2    False
3     True
4    False
5    False
6    False

This series will be used as boolean mask for the .loc[]. (Looks the same as the former series because (csv_input["Price"] > 50) all True)

If you wish to drop duplicated lines, try pd.DataFrame.drop_duplicates().

Upvotes: 1

Raunaq Jain
Raunaq Jain

Reputation: 917

You can remove the rows by doing,

csv_input = csv_input.loc[csv_input['Price'] > 50]

Upvotes: 0

Related Questions