Iwan Thomas
Iwan Thomas

Reputation: 313

Looping through a list of pandas dataframes

Two quick pandas questions for you.

  1. I have a list of dataframes I would like to apply a filter to.

    countries = [us, uk, france]
    for df in countries:
        df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')] 
    

    When I run this, the df's don't change afterwards. Why is that? If I loop through the dataframes to create a new column, as below, this works fine, and changes each df in the list.

     for df in countries:
          df["Continent"] = "Europe"
    
  2. As a follow up question, I noticed something strange when I created a list of dataframes for different countries. I defined the list then applied transformations to each df in the list. After I transformed these different dfs, I called the list again. I was surprised to see that the list still pointed to the unchanged dataframes, and I had to redefine the list to update the results. Could anybody shed any light on why that is?

Upvotes: 12

Views: 20666

Answers (2)

Janet Lu
Janet Lu

Reputation: 21

For why

for df in countries:
    df["Continent"] = "Europe"

modifies countries, while

for df in countries:
    df = df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')] 

does not, see why should I make a copy of a data frame in pandas. df is a reference to the actual DataFrame in countries, and not the actual DataFrame itself, but modifications to a reference affect the original DataFrame as well. Declaring a new column is a modification. However, taking a subset is not a modification. It is just changing what the reference is referring to in the original DataFrame.

Upvotes: 1

miradulo
miradulo

Reputation: 29690

Taking a look at this answer, you can see that for df in countries: is equivalent to something like

for idx in range(len(countries)):
    df = countries[idx]
    # do something with df

which obviously won't actually modify anything in your list. It is generally bad practice to modify a list while iterating over it in a loop like this.

A better approach would be a list comprehension, you can try something like

 countries = [us, uk, france]
 countries = [df[(df["Send Date"] > '2016-11-01') & (df["Send Date"] < '2016-11-30')]
              for df in countries] 

Notice that with a list comprehension like this, we aren't actually modifying the original list - instead we are creating a new list, and assigning it to the variable which held our original list.

Also, you might consider placing all of your data in a single DataFrame with an additional country column or something along those lines - Python-level loops are generally slower and a list of DataFrames is often much less convenient to work with than a single DataFrame, which can fully leverage the vectorized pandas methods.

Upvotes: 11

Related Questions