Reputation: 9348
An Excel spreadsheet looked like below.
With Pandas, I want to remove the columns “Project C” and “Project E”, and all rows with value “XX” in Columns “Project A” and “Project D”.
import pandas as pd
import numpy as np
work_file = "C:\\test.xlsx"
df = pd.read_excel(work_file, sheetname = "Sheet1", index_col = 0)
column_list_to_remove = [“Project C", “Project E"]
results1 = df.drop(column_list_to_remove, axis=1)
writer = pd.ExcelWriter("C:\\test new.xlsx")
pd.formats.format.header_style = None
results1.to_excel(writer,'Sheet1')
writer.save()
Above work well in removing the columns.
I try added on this line to remove rows with value “XX” in “Project A”, and failed. Without hope to remove those in “Project D”.
results1 = results1[results1."Project A" != "XX"]
# SyntaxError: invalid syntax
How can I remove those rows? Thank you.
Upvotes: 1
Views: 1590
Reputation: 3462
If your column names did not have whitespace, you could have done something along the lines of:
results1 = results1.query("ColumnName != 'XX'")
or alternatively inplace:
results1.query("ColumnName != 'XX'", inplace=True)
Alas, they do. Now you can either rename those columns, as whitespace is evil incarnate, or do:
results1 = results1[results1['Project A'] != 'XX']
Basically: create a mask for the index and slice the dataframe.
Renaming your columns without the whitespace can be done in a oneliner as well:
results1.columns = [''.join(x.strip().split()) for x in results1.columns]
Then you can proceed with something like:
column_list = ['ProjectA', 'ProjectD']
for col in column_list:
results1.query(col+"!='XX'", inplace=True)
Upvotes: 1