Kristjan
Kristjan

Reputation: 429

Python processing CSV file really slow

So I am trying to open a CSV file, read its fields and based on that fix some other fields and then save that data back to csv. My problem is that the CSV file has 2 million rows. What would be the best way to speed this up.
The CSV file consists of

ID; DATE(d/m/y); SPECIAL_ID; DAY; MONTH; YEAR

I am counting how often a row with the same date appears on my record and then update SPECIAL_ID based on that data.

Based on my previous research I decided to use pandas. I'll be processing even bigger sets of data in future (1-2GB) - this one is around 119MB so it crucial I find a good fast solution.

My code goes as follows:

df = pd.read_csv(filename, delimiter=';') 
df_fixed= pd.DataFrame(columns=stolpci) #when I process the row in df I append it do df_fixed
d = 31
m = 12
y = 100
s = (y,m,d)
list_dates= np.zeros(s) #3 dimensional array. 
for index, row in df.iterrows():
    # PROCESSING LOGIC GOES HERE
    # IT CONSISTS OF FEW IF STATEMENTS
    list_dates[row.DAY][row.MONTH][row.YEAR] += 1
    row['special_id'] = list_dates[row.DAY][row.MONTH][row.YEAR]
    df_fixed = df_fixed.append(row.to_frame().T)


df_fixed .to_csv(filename_fixed, sep=';', encoding='utf-8')  

I tried to make a print for every thousand rows processed. At first, my script needs 3 seconds for 1000 rows, but the longer it runs the slower it gets. at row 43000 it needs 29 seconds and so on...

Thanks for all future help :)

EDIT: I am adding additional information about my CSV and exptected output

ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505__-;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505__-;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505__-;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505__-;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505__-;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505__-;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505__-;F;6;1001001;1001001_F_6;16;8;2011

I have to replace - in the special ID field to a proper number. For example for a row with ID = 2 the SPECIAL_ID will be 26022018505001 (- got replaced by 001) if someone else in the CSV shares the same DAY, MONTH, YEAR the __- will be replaced by 002 and so on... So exptected output for above rows would be

ID;SPECIAL_ID;sex;age;zone;key;day;month;year
2;13012016505001;F;1;1001001;1001001_F_1;13;1;2016
3;25122013505001;F;4;1001001;1001001_F_4;25;12;2013
4;24022012505001;F;5;1001001;1001001_F_5;24;2;2012
5;09032012505001;F;5;1001001;1001001_F_5;9;3;2012
6;21082011505001;F;6;1001001;1001001_F_6;21;8;2011
7;16082011505001;F;6;1001001;1001001_F_6;16;8;2011
8;21102011505002;F;6;1001001;1001001_F_6;16;8;2011

EDIT: I changed my code to something like this: I fill list of dicts with data and then convert that list do dataframe and save as csv. This will take around 30minutes to complete

list_popravljeni = []
df = pd.read_csv(filename, delimiter=';')
df_dates = df.groupby(by=['dan_roj', 'mesec_roj', 'leto_roj']).size().reset_index() 
    for index, row in df_dates.iterrows():
        df_candidates= df.loc[(df['dan_roj'] == dan_roj) & (df['mesec_roj'] == mesec_roj) & (df['leto_roj'] == leto_roj) ]
        for index, row in df_candidates.iterrows():
            vrstica = {}
            vrstica['ID'] = row['identifikator']
            vrstica['SPECIAL_ID'] = row['emso'][0:11] + str(index).zfill(2)
            vrstica['day'] = row['day']
            vrstica['MONTH'] = row['MONTH']
            vrstica['YEAR'] = row['YEAR']   
            list_popravljeni.append(vrstica)
    pd.DataFrame(list_popravljeni, columns=list_popravljeni[0].keys())

Upvotes: 2

Views: 620

Answers (1)

roganjosh
roganjosh

Reputation: 13185

I think this gives what you're looking for and avoids looping. Potentially it could be more efficient (I wasn't able to find a way to avoid creating counts). However, it should be much faster than your current approach.

df['counts'] = df.groupby(['year', 'month', 'day'])['SPECIAL_ID'].cumcount() + 1
df['counts'] = df['counts'].astype(str)
df['counts'] = df['counts'].str.zfill(3)
df['SPECIAL_ID'] = df['SPECIAL_ID'].str.slice(0, -3).str.cat(df['counts'])

I added a fake record at the end to confirm it does increment properly:

     SPECIAL_ID sex age     zone          key day month  year counts
0  13012016505001   F   1  1001001  1001001_F_1  13     1  2016    001
1  25122013505001   F   4  1001001  1001001_F_4  25    12  2013    001
2  24022012505001   F   5  1001001  1001001_F_5  24     2  2012    001
3  09032012505001   F   5  1001001  1001001_F_5   9     3  2012    001
4  21082011505001   F   6  1001001  1001001_F_6  21     8  2011    001
5  16082011505001   F   6  1001001  1001001_F_6  16     8  2011    001
6  21102011505002   F   6  1001001  1001001_F_6  16     8  2011    002
7  21102012505003   F   6  1001001  1001001_F_6  16     8  2011    003

If you want to get rid of counts, you just need:

df.drop('counts', inplace=True, axis=1)

Upvotes: 2

Related Questions