Reputation: 361
I'm extracting large amounts of entries from a database (> 15 millions entries) with the goal of exporting to a CSV file. What I get at the end of my request is a one-column dataframe with over 15 millions rows. I'm looking for a function to split the entries into multiple columns every million entries.
So for a 5 million entries column, I would like to have 5 columns of 1 million entries each.
Thanks in advance!
Upvotes: 1
Views: 2066
Reputation: 18561
I agree with @EdChum that this would be simplest given a Series object named s
:
d = pd.DataFrame(s.values.reshape(1000000, -1))
which would reshape your Series into a DataFrame of shape (1,000,000, s.len / 1,000,000)
.
However, the above only works if you have a series whose length is an exact multiple of 1,000,000. Alternatively, you could do something like:
# note with python3, you need to use integer division // here
s.index = pd.MultiIndex.from_tuples([(x/1000000,x%1000000) for x in s.index])
# or an alternative below which does the same thing
#s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: (x/1000000, x%1000000)))
s.unstack(0)
which will give you several columns of the same length with the last column padded with NaN
s.
Here's an example with a Series of length 55 which I want split into columns of length 10. Note the last column has the last 5 values set to NaN
:
In [42]: s = pd.Series(np.arange(55))
In [43]: s
Out[43]:
0 0
1 1
2 2
...
53 53
54 54
dtype: int64
# with python3 x//10, x%10
In [44]: s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: (x/10, x%10)))
In [45]: s.unstack(0)
Out[45]:
0 1 2 3 4 5
0 0 10 20 30 40 50
1 1 11 21 31 41 51
2 2 12 22 32 42 52
3 3 13 23 33 43 53
4 4 14 24 34 44 54
5 5 15 25 35 45 NaN
6 6 16 26 36 46 NaN
7 7 17 27 37 47 NaN
8 8 18 28 38 48 NaN
9 9 19 29 39 49 NaN
Note two things:
that using s.index.map(lambda ...)
should be faster than the list comprehension for very large arrays.
if using python3, make sure to user integer division in the lambda function: lambda x: (x // N, x % N)
.
Upvotes: 4
Reputation: 3677
And the ugliest line of code award goes to....
x = 1000000
pd.concat([pd.DataFrame(np.array(df[df.columns[0]].tolist())[:-(len(df)%x )].reshape(len(df)//x, x)), pd.DataFrame(df[df.columns[0]].tolist()[len(df) - len(df)%x:])] , axis=1)
You should be set for any value of x. No doubt stuff can be 100% prettier, was just messing around with ipython ;)
Upvotes: 0