John_Mtl
John_Mtl

Reputation: 361

Splitting pandas series into multiple columns

I'm extracting large amounts of entries from a database (> 15 millions entries) with the goal of exporting to a CSV file. What I get at the end of my request is a one-column dataframe with over 15 millions rows. I'm looking for a function to split the entries into multiple columns every million entries.

So for a 5 million entries column, I would like to have 5 columns of 1 million entries each.

Thanks in advance!

Upvotes: 1

Views: 2066

Answers (2)

wflynny
wflynny

Reputation: 18561

I agree with @EdChum that this would be simplest given a Series object named s:

d = pd.DataFrame(s.values.reshape(1000000, -1))

which would reshape your Series into a DataFrame of shape (1,000,000, s.len / 1,000,000).


However, the above only works if you have a series whose length is an exact multiple of 1,000,000. Alternatively, you could do something like:

    # note with python3, you need to use integer division // here
    s.index = pd.MultiIndex.from_tuples([(x/1000000,x%1000000) for x in s.index])
    # or an alternative below which does the same thing
    #s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: (x/1000000, x%1000000)))
    s.unstack(0)

which will give you several columns of the same length with the last column padded with NaNs.

Here's an example with a Series of length 55 which I want split into columns of length 10. Note the last column has the last 5 values set to NaN:

In [42]: s = pd.Series(np.arange(55))

In [43]: s
Out[43]: 
0      0
1      1
2      2
...
53    53
54    54
dtype: int64

#                                                      with python3 x//10, x%10
In [44]: s.index = pd.MultiIndex.from_tuples(s.index.map(lambda x: (x/10, x%10)))

In [45]: s.unstack(0)
Out[45]: 
   0   1   2   3   4   5
0  0  10  20  30  40  50
1  1  11  21  31  41  51
2  2  12  22  32  42  52
3  3  13  23  33  43  53
4  4  14  24  34  44  54
5  5  15  25  35  45 NaN
6  6  16  26  36  46 NaN
7  7  17  27  37  47 NaN
8  8  18  28  38  48 NaN
9  9  19  29  39  49 NaN

Note two things:

  1. that using s.index.map(lambda ...) should be faster than the list comprehension for very large arrays.

  2. if using python3, make sure to user integer division in the lambda function: lambda x: (x // N, x % N).

Upvotes: 4

PdevG
PdevG

Reputation: 3677

And the ugliest line of code award goes to....

x = 1000000
pd.concat([pd.DataFrame(np.array(df[df.columns[0]].tolist())[:-(len(df)%x )].reshape(len(df)//x, x)), pd.DataFrame(df[df.columns[0]].tolist()[len(df) - len(df)%x:])] , axis=1)

You should be set for any value of x. No doubt stuff can be 100% prettier, was just messing around with ipython ;)

Upvotes: 0

Related Questions