Reputation: 11

How to iterate over very big dataframes in python?

I have a code and my dataframe contains almost 800k rows and therefore it is impossible to iterate over it by using standard methods. I searched a little bit and see a method of iterrows() but i couldn't understand how to use. Basicly this is my code and can you help me how to update it for iterrows()?

for i in range(len(x["Value"])):
    if x.loc[i ,"PP_Name"] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'] :
        x.loc[i,"Santral_Type"] = "HES"
    elif x.loc[i ,"PP_Name"] in ['BND','BND2','TFB','TFB3','TFB4','KNT']:
        x.loc[i,"Santral_Type"] = "TERMIK"
    elif x.loc[i ,"PP_Name"] in ['BRS','ÇKL','DPZ']:
        x.loc[i,"Santral_Type"] = "RES"
    else :  x.loc[i,"Santral_Type"] = "SOLAR"

Upvotes: 0

Answers (5)

Umar.H

Reputation: 23099

I would advise strongly against using iterrows and for loops when you have vectorised solutions available which take advantage of the pandas api.

this is your code adapted with numpy which should run much faster than your current method.

import numpy as np
col = 'PP_Name'

conditions = [
        x[col].isin(
['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
            ),
            x[col].isin(["BND", "BND2", "TFB", "TFB3", "TFB4", "KNT"]),
            x[col].isin(["BRS", "ÇKL", "DPZ"])]

outcomes = ["HES", "TERMIK", "RES"]

x["Santral_Type"] = np.select(conditions, outcomes, default='SOLAR')

Upvotes: 1

giulio

Reputation: 157

the simplest method could be .values, example:

 def f(x0,...xn):
      return('hello or some complicated operation')
 df['newColumn']=[f(r[0],r[1],...,r[n]) for r in df.values]

the drawbacks of this method as far as i know is that you cannot refer to the column values by name but just by position and there is no info about the index of the df. Advantage is faster than iterrows, itertuples and apply methods.

hope it helps

Upvotes: 0

Quang Hoang

Reputation: 150785

How to iterate over very big dataframes -- In general, you don't. You should use some sort of vectorize operation to the column as a whole. For example, your case can be map and fillna:

map_dict = {
    'HES' : ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay'],
    'TERMIK' : ['BND','BND2','TFB','TFB3','TFB4','KNT'],
    'RES' : ['BRS','ÇKL','DPZ']
}

inv_map_dict = {x:k for k,v in map_dict.items() for x in v}

df['Santral_Type'] = df['PP_Name'].map(inv_map_dict).fillna('SOLAR')

Upvotes: 3

mcsoini

Reputation: 6642

It is not advised to iterate through DataFrames for these things. Here is one possible way of doing it, applied to all rows of the DataFrame x at once:

# Default value
x["Santral_Type"] = "SOLAR"

x.loc[x.PP_Name.isin(['BRS','ÇKL','DPZ']), 'Santral_Type'] = "RES"
x.loc[x.PP_Name.isin(['BND','BND2','TFB','TFB3','TFB4','KNT']), 'Santral_Type'] = "TERMIK"
hes_list = ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']
x.loc[x.PP_Name.isin(hes_list), 'Santral_Type'] = "HES"

Note that 800k can not be considered a large table when using standard pandas methods.

Upvotes: 1

Nauman Naeem

Reputation: 408

df.iterrows() according to documentation returns a tuple (index, Series).
You can use it like this:

for row in df.iterrows():
    if row[1]['PP_Name'] in ['ARK','DGD','KND','SRG','HCO','MNG','KSK','KOP','KVB','Yamanli','ÇBS','Dogancay']:
        df['Santral_Type] = "HES"
        # and so on

By the way, I must say, using iterrows is going to be very slow, and looking at your sample code it's clear you can use simple pandas selection techniques to do this without explicit loops.
Better to do it as @mcsoini suggested

Upvotes: 0

How to iterate over very big dataframes in python?

Answers (5)

Related Questions