abhivemp
abhivemp

Reputation: 932

Iterate through a list of dataframes to drop particular rows Pandas

In my previous question where I asked to drop particular rows in Pandas

With help, I was to drop rows that before 1980. The 'Season' column (that had the years) were in this format:

 2018-19
 2017-18
 This
 list would go
 till 1960

In the earlier question(linked) @jezrael gave a solution that helped me drop rows before 1980.

I have a list (called list) that has 30 dataframes. I want to iterate through the 30 dataframes and drop all rows before 1980 for every df. For instance, one of the items in list is BOS if BOS['Season] has:

 2018-19
 2017-18
 1959-1960

I should get

2018-19
2017-18

And the result like this should be for all dataframes in list

This is what I tried, but got errors or nothing would happen:

for df in list:
   df = df[df['Season'].str.split('-').str[0].astype(int) > 1980]

What's wrong with my code? I am new to python. I thought by assigning df to the the change, it would implement it to every 'df in the list

Thanks!

UPDATE: I have a list named as league. This list has 30 DataFrames. I looked at jazrael's and IMCoin's solution. Both of them worked. But here is my requirement.

After dropping rows before 1980 for every DataFrame. I want to be able to use that DataFrame directly, and not through the list. Here is what I mean.


#Before Appending to the list
BOS = pd.read_csv(dir+"Boston_Sheet")
# I have 30 different cities, each having a CSV file and making each city have 
# their own DataFrame. So Boston as `BOS`, Chicago as `CHI` and like that 30 cities. 

All of those 30 city DataFrames have already been appended to the list league. After filtering the city DataFrame to the conditions above, I want to be able to call BOS or CHI with the filtered data. This is just so that it will be easy for me developing other functions down the line.

Upvotes: 2

Views: 701

Answers (2)

jezrael
jezrael

Reputation: 863216

You need create new list of filtered DataFrames or reaasign old one:

Notice: Dont use variable list, because builtins (python code word).

L = [df[df['Season'].str.split('-').str[0].astype(int) > 1980] for df in L]

Loop version:

output = []
for df in L:
   df = df[df['Season'].str.split('-').str[0].astype(int) > 1980]
   output.append(df)

If need extract only first integers with length 4:

L = [df, df]
L = [df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980] 
          for df in L]

print (L)
[    Season
0  2018-19
1  2017-18,     Season
0  2018-19
1  2017-18]

EDIT:

If data have same structure I suggest create one big DataFrame with new column for distinguish cities:

import glob

files = glob.glob('files/*.csv')
dfs = [pd.read_csv(fp).assign(City=os.path.basename(fp).split('.')[0]) for fp in files]
df = pd.concat(dfs, ignore_index=True)
print (df)
          Season           City
0        2018-19   Boston_Sheet
1           This   Boston_Sheet
2  list would go   Boston_Sheet
3      till 1960   Boston_Sheet
4        2018-19  Chicago_Sheet
5        2017-18  Chicago_Sheet
6           This  Chicago_Sheet

df1 = df[df['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980]
print (df1)
     Season           City
0   2018-19   Boston_Sheet
4   2018-19  Chicago_Sheet
5   2017-18  Chicago_Sheet

df2 = df1[df1['City'] == 'Boston_Sheet']
print (df2)
    Season          City
0  2018-19  Boston_Sheet

df3 = df1[df1['City'] == 'Chicago_Sheet']
print (df3)
     Season           City
4   2018-19  Chicago_Sheet
5   2017-18  Chicago_Sheet

If need each DataFrame separate, it is possible by dictionary of DataFrames:

import glob

files = glob.glob('files/*.csv')
dfs_dict = {os.path.basename(fp).split('.')[0] : pd.read_csv(fp) for fp in files}

print (dfs_dict)

print (dfs_dict['Boston_Sheet'])
          Season
0        2018-19
1           This
2  list would go
3      till 1960

print (dfs_dict['Chicago_Sheet'])
0   2018-19
1   2017-18
2      This

Then processing in dictionary comprehension:

dfs_dict = {k:v[v['Season'].str.extract('(\d{4})', expand=False).astype(float) > 1980] 
                 for k, v in dfs_dict.items()}
print (dfs_dict)
{'Boston_Sheet':     Season
0  2018-19, 'Chicago_Sheet':      Season
0   2018-19
1   2017-18}

print (dfs_dict['Boston_Sheet'])
    Season
0  2018-19

print (dfs_dict['Chicago_Sheet'])
     Season
0   2018-19
1   2017-18

Upvotes: 1

IMCoins
IMCoins

Reputation: 3306

If you want to modify the list in-place :

for index in range(len(df_list)):
    df_list[index] = df_list[index].loc[df_list[index]['Season'].str.split('-').str[0].astype(int) > 1980]

When you're looping through the list object itself, it creates a new object at each iteration, that's getting erased at each turn.

If you're looping using the length of the list, and accessing your data through the index, you will modify the list itself, and not the copy you made with for some_copy_item in df_list.


Minimal example :

    arr = [1, 2, 3, 4, 5]
    print(arr) # [1, 2, 3, 4, 5]

    for number in arr:
        number += 1
    print(arr) # [1, 2, 3, 4, 5]

    for idx in range(len(arr)):
        arr[idx] += 1
    print(arr) # [2, 3, 4, 5, 6]

Upvotes: 1

Related Questions