Reputation: 932
In my previous question where I asked to drop particular rows in Pandas
With help, I was to drop rows that before 1980. The 'Season' column (that had the years) were in this format:
2018-19
2017-18
This
list would go
till 1960
In the earlier question(linked) @jezrael gave a solution that helped me drop rows before 1980.
I have a list (called list
) that has 30 dataframes. I want to iterate through the 30 dataframes and drop all rows before 1980 for every df
. For instance, one of the items in list
is BOS
if BOS['Season]
has:
2018-19
2017-18
1959-1960
I should get
2018-19
2017-18
And the result like this should be for all dataframes in list
This is what I tried, but got errors or nothing would happen:
for df in list:
df = df[df['Season'].str.split('-').str[0].astype(int) > 1980]
What's wrong with my code? I am new to python. I thought by assigning df
to the the change, it would implement it to every 'df
in the list
Thanks!
UPDATE:
I have a list
named as league
. This list has 30 DataFrames. I looked at jazrael's and IMCoin's solution. Both of them worked. But here is my requirement.
After dropping rows before 1980 for every DataFrame. I want to be able to use that DataFrame directly, and not through the list. Here is what I mean.
#Before Appending to the list
BOS = pd.read_csv(dir+"Boston_Sheet")
# I have 30 different cities, each having a CSV file and making each city have
# their own DataFrame. So Boston as `BOS`, Chicago as `CHI` and like that 30 cities.
All of those 30 city DataFrames have already been appended to the list league
.
After filtering the city DataFrame to the conditions above, I want to be able to call BOS
or CHI
with the filtered data. This is just so that it will be easy for me developing other functions down the line.
Upvotes: 2
Views: 701
Reputation: 863216
You need create new list of filtered DataFrames or reaasign old one:
Notice: Dont use variable list
, because builtins
(python code word).
L = [df[df['Season'].str.split('-').str[0].astype(int) > 1980] for df in L]
Loop version:
output = []
for df in L:
df = df[df['Season'].str.split('-').str[0].astype(int) > 1980]
output.append(df)
If need extract only first integers with length 4:
L = [df, df]
L = [df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980]
for df in L]
print (L)
[ Season
0 2018-19
1 2017-18, Season
0 2018-19
1 2017-18]
EDIT:
If data have same structure I suggest create one big DataFrame with new column for distinguish cities:
import glob
files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp).assign(City=os.path.basename(fp).split('.')[0]) for fp in files]
df = pd.concat(dfs, ignore_index=True)
print (df)
Season City
0 2018-19 Boston_Sheet
1 This Boston_Sheet
2 list would go Boston_Sheet
3 till 1960 Boston_Sheet
4 2018-19 Chicago_Sheet
5 2017-18 Chicago_Sheet
6 This Chicago_Sheet
df1 = df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980]
print (df1)
Season City
0 2018-19 Boston_Sheet
4 2018-19 Chicago_Sheet
5 2017-18 Chicago_Sheet
df2 = df1[df1['City'] == 'Boston_Sheet']
print (df2)
Season City
0 2018-19 Boston_Sheet
df3 = df1[df1['City'] == 'Chicago_Sheet']
print (df3)
Season City
4 2018-19 Chicago_Sheet
5 2017-18 Chicago_Sheet
If need each DataFrame separate, it is possible by dictionary of DataFrames:
import glob
files = glob.glob('files/*.csv')
dfs_dict = {os.path.basename(fp).split('.')[0] : pd.read_csv(fp) for fp in files}
print (dfs_dict)
print (dfs_dict['Boston_Sheet'])
Season
0 2018-19
1 This
2 list would go
3 till 1960
print (dfs_dict['Chicago_Sheet'])
0 2018-19
1 2017-18
2 This
Then processing in dictionary comprehension:
dfs_dict = {k:v[v['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980]
for k, v in dfs_dict.items()}
print (dfs_dict)
{'Boston_Sheet': Season
0 2018-19, 'Chicago_Sheet': Season
0 2018-19
1 2017-18}
print (dfs_dict['Boston_Sheet'])
Season
0 2018-19
print (dfs_dict['Chicago_Sheet'])
Season
0 2018-19
1 2017-18
Upvotes: 1
Reputation: 3306
If you want to modify the list in-place :
for index in range(len(df_list)):
df_list[index] = df_list[index].loc[df_list[index]['Season'].str.split('-').str[0].astype(int) > 1980]
When you're looping through the list object itself, it creates a new object at each iteration, that's getting erased at each turn.
If you're looping using the length of the list, and accessing your data through the index, you will modify the list itself, and not the copy you made with for some_copy_item in df_list
.
Minimal example :
arr = [1, 2, 3, 4, 5]
print(arr) # [1, 2, 3, 4, 5]
for number in arr:
number += 1
print(arr) # [1, 2, 3, 4, 5]
for idx in range(len(arr)):
arr[idx] += 1
print(arr) # [2, 3, 4, 5, 6]
Upvotes: 1