Reputation: 9538
I am training on pandas and how to sum a series in a DataFrame. And I could use two ways using list and normal variable. The code is like that
import pandas as pd
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url , chunksize=250)
result = []
for chunk in df:
result.append(sum(chunk['duration']))
print(sum(result))
The code is working well and the output is 118439
And when using a variable instead of list like that
import pandas as pd
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url , chunksize=250)
total = 0
for chunk in df:
total += sum(chunk['duration'])
print(total)
The output is the same 118439
** The problem is when trying the both approaches in one code like that
import pandas as pd
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url, chunksize=250)
result = []
for chunk in df:
result.append(sum(chunk['duration']))
print(sum(result))
total = 0
for chunk in df:
total += sum(chunk['duration'])
print(total)
I got the result for the first approach but got 0 for the total
variable. Any ideas why I got 0 when combining the two approaches?
** Remove the space in the url.
Upvotes: 0
Views: 91
Reputation: 9538
Generally, this is working well ..
import pandas as pd
df = pd.read_csv('http://bit .ly/imdbratings', chunksize=250)
result = []
total = 0
for chunk in df:
print(sum(chunk['duration']), len(chunk['duration']))
result.append(sum(chunk['duration']))
total += sum(chunk['duration'])
print('-'*10)
print(sum(result))
print(total)
Upvotes: 0
Reputation: 6337
The chunk is changing while calling. In my opinion this is an unexpected behavior and it has to be investigated.
If you print a number for each iteration you can see that you do npt enter the seconde code block and that's why your total
variable stays at zero.
Try to run:
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url, chunksize=250)
print('Test Chunk 1')
for i, chunk in enumerate(df):
print(i)
#print(chunk)
print('Test Chunk 2')
for i, chunk in enumerate(df):
print(i)
#print(chunk)
>>> Test Chunk 1
0
1
2
3
Test Chunk 2
EDIT
I found a solution how to read in the data and store it as a pandas DataFrame, thanks to this post.
There is one line added to concatenate all the chunks. Than you don't have to loop over the chunks anymore.
url = 'http://bit .ly/imdbratings'
df = pd.read_csv(url, chunksize=250)
df = pd.concat(df, ignore_index=True) # added line here
total = df['duration'].sum()
print(total)
Upvotes: 1