Reputation: 2855
I need to obtain the type for each column to properly preprocess it.
Currently I do this via the following method:
import pandas as pd
# input is of type List[List[any]]
# but has one type (int, float, str, bool) per column
df = pd.DataFrame(input, columns=key_labels)
column_types = dict(df.dtypes)
matrix = df.values
Since I only use pandas for obtaining the dtypes (per column) and use numpy for everything else I want to cut pandas from my project.
In summary: Is there a way to obtain (specific) dtypes per column from numpy
!Or: Is there a fast way to recompute the dtype of ndarray (after splicing the matrix)
Upvotes: 6
Views: 7014
Reputation: 1352
In order to obtain each column type and use it in your program, you can use Numpy Structured Arrays.
Structured Arrays are a composition of simpler data types organized as a sequence of named fields.
They have a property called dtype
which you can use to answer your question.
Note that Numpy also has a “Record Array” or “recarray” data type, that is quite similar to Structured Arrays. But according to this post, Record Arrays are much slower than Structured Arrays and are probably kept for convenience and backward compatibility.
import numpy as np
# Initialize structured array.
df = np.array([(10, 3.14, 'Hello', True),
(20, 2.71, 'World', False)],
dtype=[
("ci", "i4"),
("cf", "f4"),
("cs", "U16"),
("cb", "?")])
# Basic usage.
print(df)
print(np.size(df))
print(df.shape)
print(df["cs"])
print(df["cs"][0])
print(type(df))
print(df.dtype)
print(df.dtype.names)
# Check exact data type.
print(df.dtype["ci"] == "i4")
print(df.dtype["cf"] == "f4")
print(df.dtype["cs"] == "U16")
print(df.dtype["cb"] == "?")
# Check general data type kind.
print(df.dtype["ci"].kind == "i")
print(df.dtype["cf"].kind == "f")
print(df.dtype["cs"].kind == "U")
print(df.dtype["cb"].kind == "b")
Upvotes: 2
Reputation: 231665
It would help if you gave a concrete example, but I'll demonstrate with @jpp's
list:
In [509]: L = [[0.5, True, 'hello'], [1.25, False, 'test']]
In [510]: df = pd.DataFrame(L)
In [511]: df
Out[511]:
0 1 2
0 0.50 True hello
1 1.25 False test
In [512]: df.dtypes
Out[512]:
0 float64
1 bool
2 object
dtype: object
pandas
doesn't like to use string dtypes, so the last column is object
.
In [513]: arr = df.values
In [514]: arr
Out[514]:
array([[0.5, True, 'hello'],
[1.25, False, 'test']], dtype=object)
So because of the mix in column dtypes, pandas
is making the whole thing object
. I don't know pandas well enough to know if you can control the dtype better.
To make a numpy
structured array from L
, the obvious thing to do is:
In [515]: np.array([tuple(row) for row in L], dtype='f,bool,U10')
Out[515]:
array([(0.5 , True, 'hello'), (1.25, False, 'test')],
dtype=[('f0', '<f4'), ('f1', '?'), ('f2', '<U10')])
That answers the question of how to specify a different dtype per 'column'. But keep in mind that this array is 1d, and has fields
not columns
.
But whether it's possible to deduce or set the dtype automatically, that's trickier. It might be possible to build a recarray
from the columns, or use one of the functions in np.lib.recfunctions
.
If I use a list 'transpose' I can format each column as a separate numpy array.
In [537]: [np.array(col) for col in zip(*L)]
Out[537]:
[array([0.5 , 1.25]),
array([ True, False]),
array(['hello', 'test'], dtype='<U5')]
Then join them into one array with rec.fromarrays
:
In [538]: np.rec.fromarrays([np.array(col) for col in zip(*L)])
Out[538]:
rec.array([(0.5 , True, 'hello'), (1.25, False, 'test')],
dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])
Or I could use genfromtxt
to deduce fields from a csv
format.
In [526]: np.savetxt('test.txt', np.array(L,object),delimiter=',',fmt='%s')
In [527]: cat test.txt
0.5,True,hello
1.25,False,test
In [529]: data = np.genfromtxt('test.txt',dtype=None,delimiter=',',encoding=None)
In [530]: data
Out[530]:
array([(0.5 , True, 'hello'), (1.25, False, 'test')],
dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])
Upvotes: 3
Reputation: 22033
In numpy, an array has the same dtypes for all its entries. So no, it's not possible to have the dedicated/fast float in one column and another one in another column.
That's the point of pandas to allow you to jump from one column with one type to another.
Upvotes: 2
Reputation: 164823
Is there a way to obtain (specific) dtypes per column from numpy
No, there isn't. Since your dataframe has mixed types, your NumPy dtype will be object
. Such an array is not stored in a contiguous memory block with each column having a fixed dtype. Instead, each value in the 2d array consists of a pointer.
Your question is no different from asking whether you can get the type of each "column" in this list of lists:
L = [[0.5, True, 'hello'], [1.25, False, 'test']]
Since the data in a collection of pointers has no columnar structure, there's no concept of "column dtype". You can test the type of each value for specific indices in each sublist. But this defeats the point of Pandas / NumPy.
Upvotes: 1