David Hancock
David Hancock

Reputation: 1071

Memory error when running medium sized merge function ipython notebook jupyter

I'm trying to merge around 100 dataframes with a for loop and am getting a memory error. I'm using ipython jupyter notebook

Here is a sample of the data:

    timestamp   Namecoin_cap
0   2013-04-28  5969081
1   2013-04-29  7006114
2   2013-04-30  7049003

Each frame is around 1000 lines long

Here's the error in detail, I've also include my merge function.

My system is currently using up 64% of it memory

I have searched for similar issues but it seems most are for very large arrays >1GB, my data is relatively small in comparison.

EDIT: Something is suspicious. I wrote a beta program before, this was to test with 4 dataframes, i just exported that through pickle and it is 500kb. Now when i try to export the 100 frames one I get a memory error. It does however export a file that is 2GB. So i suspect somewhere down the line my code has created some kind of loop, creating a very large file. NB the 100 frames are stored in a dictionary

EDIT2: I have exported the scrypt to .py

http://pastebin.com/GqaHr7xc

This is a .xlsx that cointains asset names the script needs

The script fetches data regarding various assets, then cleans it up and saves each asset to a data frame in a dictionary

I'd be really appreciative if someone could have a look and see if there's anything immediately wrong. Other wise please advise on what tests I can run.

EDIT3: I'm finding it really hard to understand why this is happening, the code worked fine in the beta, all i have done now is add more assets.

EDIT4: I ran I size check on the object (dict of dfs) and it is 1,066,793 bytes

EDIT5: The problem is in the merge function for coin 37

for coin in coins[:37]:
    data2['merged'] = pd.merge(left=data2['merged'],right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')

This is when the error occurs. for coin in coins[:36]:' doesn't produce an error howeverfor coin in coins[:37]:' produces the error, any ideas ?

EDIT6: the 36th element is 'Syscoin', i did coins.remove('Syscoin') however the memory problem still occurs. So it seems to be a problem with the 36th element in coins no matter what the coin is

EDIT7: goCards suggestions seemed to work however the next part of the code:

merged = data2['merged']
merged['Total_MC'] = merged.drop('timestamp',axis=1).sum(axis=1)

Produces a memory error. I'm stumped

Upvotes: 1

Views: 11541

Answers (2)

VishnuVardhanA
VishnuVardhanA

Reputation: 607

Same issue happened to me! "MemoryError:" by notebook on execution of pandas. I have also screen printed quite lot of observations before issued happened.

Reinstalling Anaconda didn't help. Later realized that i was working with IPython notebook instead Jupyter notebook. Switched to Jupyter notebook. Everything worked fine!

Upvotes: 0

goCards
goCards

Reputation: 1436

In regard to storage, I would recommend using a simple csv over pickle. Csv is a more generic format. It is human readable,and you can check your data quality easier especially as your data grows.

file_template_string='%s.csv'
for eachKey in dfDict:
    filename = file_template_string%(eachKey)
    dfDict[eachKey].to_csv(filename)

If you need to date the files you can also put a timestamp in the filename.

import time
from datetime import datetime
cur = time.time()
cur = datetime.fromtimestamp(cur)
file_template_string = "%s_{0}.csv".format(cur.strftime("%m_%d_%Y_%H_%M_%S"))

There are some obvious errors in your code.

for coin in coins: #line 61,89
for coin in data: #should be

df = data2['Namecoin'] #line 87
keys = data2.keys()
keys.remove('Namecoin')
for coin in keys:
    df = pd.merge(left=df,right=data2[coin], left_on='timestamp', right_on='timestamp', how='left')

Upvotes: 2

Related Questions