Reputation: 401
Okay, so I have a DataFrame with a 2 column index, and I am trying to filter the rows from that DataFrame and keep ONLY THE INDEX COLUMNS of the original dataframe into the new filtered DataFrame.
I created the dataframe from a CSV file by: Find the CSV file here
census_df = pd.read_csv("census.csv", index_col = ["STNAME", "CTYNAME"])
census_df.sort_index(ascending = True)
Then, I applied some filtering to the DataFrame, which works perfectly fine, and I get the desired rows. The code I used is shown below:
def my_answer():
mask1 = census_df["REGION"].between(1, 2)
mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
new_df = census_df[mask1 & mask2 & mask3]
return pd.DataFrame(new_df.iloc[:, -1])
my_answer()
Here is the problem:
The above code returns a dataframe with the index AND the first column IN ADDITION to the 2 index columns. What I want is JUST THE TWO INDEX COLUMNS. So, the final answer should return a DATAFRAME, with "STNAME" and "CTYNAME", with 5 rows in it.
Upvotes: 5
Views: 26444
Reputation: 637
Using list comprehension:
def my_answer():
mask1 = census_df["REGION"].between(1, 2)
mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
new_df = census_df[mask1 & mask2 & mask3]
return pd.DataFrame([new_df.index[x] for x in range(len(new_df))])
my_answer()
Output:
0 1
0 Iowa Washington County
1 Minnesota Washington County
2 Pennsylvania Washington County
3 Rhode Island Washington County
4 Wisconsin Washington County``
Upvotes: 1
Reputation: 862641
You can convert index
to DataFrame
:
def my_answer():
mask1 = census_df["REGION"].between(1, 2)
mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
new_df = census_df[mask1 & mask2 & mask3]
return pd.DataFrame(new_df.index.tolist(), columns=['STNAME','CTYNAME'])
print (my_answer())
STNAME CTYNAME
0 Iowa Washington County
1 Minnesota Washington County
2 Pennsylvania Washington County
3 Rhode Island Washington County
4 Wisconsin Washington County
If want output as MultiIndex
need MultiIndex.remove_unused_levels
, but it working in pandas 0.20.0+
:
def my_answer():
mask1 = census_df["REGION"].between(1, 2)
mask2 = census_df.index.get_level_values("CTYNAME").str.startswith("Washington")
mask3 = (census_df["POPESTIMATE2015"] > census_df["POPESTIMATE2014"])
new_df = census_df[mask1 & mask2 & mask3]
return new_df.index.remove_unused_levels()
print (my_answer())
MultiIndex(levels=[['Iowa', 'Minnesota', 'Pennsylvania', 'Rhode Island', 'Wisconsin'],
['Washington County']],
labels=[[0, 1, 2, 3, 4], [0, 0, 0, 0, 0]],
names=['STNAME', 'CTYNAME'])
Upvotes: 0