YasserKhalil
YasserKhalil

Reputation: 9538

Sum series in pandas dataframe in two ways

I am training on pandas and how to sum a series in a DataFrame. And I could use two ways using list and normal variable. The code is like that

import pandas as pd
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url , chunksize=250)

result = []
for chunk in df:
    result.append(sum(chunk['duration']))
print(sum(result))

The code is working well and the output is 118439

And when using a variable instead of list like that

import pandas as pd
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url , chunksize=250)

total = 0
for chunk in df:
    total += sum(chunk['duration'])
print(total)

The output is the same 118439

** The problem is when trying the both approaches in one code like that

import pandas as pd
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url, chunksize=250)

result = []
for chunk in df:
    result.append(sum(chunk['duration']))
print(sum(result))

total = 0
for chunk in df:
    total += sum(chunk['duration'])
print(total)

I got the result for the first approach but got 0 for the total variable. Any ideas why I got 0 when combining the two approaches?

** Remove the space in the url.

Upvotes: 0

Views: 91

Answers (2)

YasserKhalil
YasserKhalil

Reputation: 9538

Generally, this is working well ..

import pandas as pd

df = pd.read_csv('http://bit .ly/imdbratings', chunksize=250)

result = []
total = 0

for chunk in df:
    print(sum(chunk['duration']), len(chunk['duration']))
    result.append(sum(chunk['duration']))
    total += sum(chunk['duration'])

print('-'*10)
print(sum(result))
print(total)

Upvotes: 0

mosc9575
mosc9575

Reputation: 6337

The chunk is changing while calling. In my opinion this is an unexpected behavior and it has to be investigated.

If you print a number for each iteration you can see that you do npt enter the seconde code block and that's why your totalvariable stays at zero.

Try to run:

url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url, chunksize=250)
print('Test Chunk 1')
for i, chunk in enumerate(df):
    print(i)
    #print(chunk)
print('Test Chunk 2')
for i, chunk in enumerate(df):
    print(i)
    #print(chunk)

>>> Test Chunk 1
0
1
2
3
Test Chunk 2

EDIT

I found a solution how to read in the data and store it as a pandas DataFrame, thanks to this post.

There is one line added to concatenate all the chunks. Than you don't have to loop over the chunks anymore.

url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url, chunksize=250)
df = pd.concat(df, ignore_index=True) # added line here

total = df['duration'].sum()
print(total)

Upvotes: 1

Related Questions