Ivan
Ivan

Reputation: 7746

Building a numpy array (matrix) from several dataframes

I have several dataframes which have the same look but different data.

DataFrame 1

                          bid
                        close
time                         
2016-05-24 00:00:00       NaN
2016-05-24 00:05:00  0.000611
2016-05-24 00:10:00 -0.000244
2016-05-24 00:15:00 -0.000122

DataFrame 2

                          bid
                        close
time                         
2016-05-24 00:00:00       NaN
2016-05-24 00:05:00  0.000811
2016-05-24 00:10:00 -0.000744
2016-05-24 00:15:00 -0.000322

I need to build a list of the dataframes, then pass that list of dataframes to a function that can take a list of dataframes and converts it to a numpy array. So below, each entry in the matrix is the elements of the dataframe ('bid close') column. Notice I don't need the index 'time' column

data = np.array([dataFrames])

returns this (example not actual data)

[[-0.00114415  0.02502565  0.00507831 ...,  0.00653057  0.02183072
  -0.00194293] `DataFrame` 1 is here ignore that the data doesn't match above
 [-0.01527224  0.02899528 -0.00327654 ...,  0.0322364   0.01821731
  -0.00766773] `DataFrame` 2 is here ignore that the data doesn't match above
 ....]]

Upvotes: 2

Views: 118

Answers (2)

piRSquared
piRSquared

Reputation: 294218

Setup

import pandas as pd
import numpy as np

df1 = pd.DataFrame([1, 2, 3, 4],
                   index=pd.date_range('2016-04-01', periods=4),
                   columns=pd.MultiIndex.from_tuples([('bid', 'close')]))
df2 = pd.DataFrame([5, 6, 7, 8],
                   index=pd.date_range('2016-03-01', periods=4),
                   columns=pd.MultiIndex.from_tuples([('bid', 'close')]))
print df1

             bid
           close
2016-04-01     1
2016-04-02     2
2016-04-03     3
2016-04-04     4

print df2

             bid
           close
2016-03-01     5
2016-03-02     6
2016-03-03     7
2016-03-04     8

Solution

df = np.concatenate([d.T.values for d in [df1, df2]])

print df

[[1 2 3 4]
 [5 6 7 8]]

Note

The indices were not required to line up. This just takes the raw np.array from each dataframe and uses np.concatenate to do the rest.

Upvotes: 1

hilberts_drinking_problem
hilberts_drinking_problem

Reputation: 11602

Try

master_matrix = pd.concat(list_of_dfs, axis=1)
master_matrix = master_matrix.values.reshape(master_matrix.shape, order='F')

if each row in the final matrix corresponds to the same date

master_matrix = pd.concat(list_of_dfs, axis=1).values

otherwise.

Edit to address the newly added example. In this case, you can use np.vstack on columns returned from each dataframe.

import pandas as pd
import numpy as np
from io import StringIO

df1 = pd.read_csv(StringIO(
'''
time                bid_close
2016-05-24 00:00:00       NaN
2016-05-24 00:05:00  0.000611
2016-05-24 00:10:00 -0.000244
2016-05-24 00:15:00 -0.000122
'''), sep=r' +')

df2 = pd.read_csv(StringIO(
'''
time                bid_close
2016-05-24 00:00:00       NaN
2016-05-24 00:05:00  0.000811
2016-05-24 00:10:00 -0.000744
2016-05-24 00:15:00 -0.000322
'''), sep=r' +')

dfs = [df1, df2]

out = np.vstack(df.iloc[:,-1].values for df in dfs)

Result:

In [10]: q.out
Out[10]:
array([[      nan,  0.000611, -0.000244, -0.000122],
       [      nan,  0.000811, -0.000744, -0.000322]])

Upvotes: 1

Related Questions