Reputation: 25
I read data from MariaDB using pandas read_sql. What I am trying to do is give each data frame an index with same continuation in each chunk.
Index of first chunk should 1 to 2000 and for second chunk to should 2001 to 4000. For that I have added a list while creating df. But after first iteration values comes as Nan.
query = 'select * from big_2L_csv'
chunk_size = 2000
start_point = 1
end_point = chunk_size+1
for chunk in pd.read_sql(query, conn,chunksize=chunk_size):
indexes = list(range(start_point,end_point))
file_data_df = pd.DataFrame(chunk,index = indexes)
print(start_point,end_point)
start_point += chunk_size
end_point += chunk_size
print(file_data_df.head(5))
print(file_data_df.tail(5))
Output :
PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
1 1000.0 0.0 2000.0 1.0
2 1000.0 1000.0 5000.0 0.0
3 1100.0 1069.0 1000.0 0.0
4 9000.0 689.0 679.0 0.0
5 1000.0 1000.0 800.0 0.0
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
1996 2000.0 1000.0 2000.0 1000.0 0.0
1997 504.0 2770.0 613.0 14.0 0.0
1998 4000.0 4500.0 4000.0 4200.0 0.0
1999 0.0 8000.0 0.0 0.0 0.0
2000 NaN NaN NaN NaN NaN
PAY_AMT3 PAY_AMT4 PAY_AMT5 PAY_AMT6 default payment next month
2001 NaN NaN NaN NaN NaN
2002 NaN NaN NaN NaN NaN
2003 NaN NaN NaN NaN NaN
2004 NaN NaN NaN NaN NaN
2005 NaN NaN NaN NaN NaN
There is still more data in table, but in output it shows Nan.
Upvotes: 0
Views: 812
Reputation: 107737
According to docs of the pandas.DataFrame
constructor, when a data object contains an existing index, the constructor will use that existing index. Therefore, by using index
you are specifying what index values to use in that existing object.
Because each chunk is a rendered DataFrame
and because you did not specify an index_col
in pandas.read_sql
, each chunk maintains the default RangeIndex
starting from zero to number of rows (i.e., chunk size). So by passing a range into index
argument that exceeds 0-200, you will return back all missing rows since such indices do not exist.
To fix, avoid the DataFrame
constructor and simply re-assign index. Below uses enumerate
to keep the start and end counters.
query = 'select * from big_2L_csv'
chunk_size = 2000
dfs = []
for i, chunk in enumerate(pd.read_sql(query, conn, chunksize=200)):
start = chunksize*i; end = chunksize*i + len(chunk))
print(start, end)
chunk.index = pd.RangeIndex(start, end)
dfs.append(chunk) # APPEND CHUNKS TO LIST OF DATA FRAMES
print(chunk.head(5))
print(chunk.tail(5))
By the way, iteratively re-assigining indexes is not needed for pandas.concat
with ignore_index
to build a new RangeIndex
across length of all data frames.
master_df = pd.concat(dfs, ignore_index=True)
Upvotes: 0