Daan Luttik
Daan Luttik

Reputation: 2855

Numpy obtain dtype per column

I need to obtain the type for each column to properly preprocess it.

Currently I do this via the following method:

import pandas as pd

# input is of type List[List[any]]
# but has one type (int, float, str, bool) per column

df = pd.DataFrame(input, columns=key_labels)
column_types = dict(df.dtypes)
matrix = df.values

Since I only use pandas for obtaining the dtypes (per column) and use numpy for everything else I want to cut pandas from my project.

In summary: Is there a way to obtain (specific) dtypes per column from numpy

!Or: Is there a fast way to recompute the dtype of ndarray (after splicing the matrix)

Upvotes: 6

Views: 7014

Answers (4)

nico
nico

Reputation: 1352

In order to obtain each column type and use it in your program, you can use Numpy Structured Arrays.

Structured Arrays are a composition of simpler data types organized as a sequence of named fields.

They have a property called dtype which you can use to answer your question.

Note that Numpy also has a “Record Array” or “recarray” data type, that is quite similar to Structured Arrays. But according to this post, Record Arrays are much slower than Structured Arrays and are probably kept for convenience and backward compatibility.

import numpy as np

# Initialize structured array.
df = np.array([(10, 3.14, 'Hello', True),
                 (20, 2.71, 'World', False)],
                dtype=[
                    ("ci", "i4"),
                    ("cf", "f4"),
                    ("cs", "U16"),
                    ("cb", "?")])

# Basic usage.
print(df)
print(np.size(df))
print(df.shape)
print(df["cs"])
print(df["cs"][0])
print(type(df))
print(df.dtype)
print(df.dtype.names)

# Check exact data type.
print(df.dtype["ci"] == "i4")
print(df.dtype["cf"] == "f4")
print(df.dtype["cs"] == "U16")
print(df.dtype["cb"] == "?")

# Check general data type kind.
print(df.dtype["ci"].kind == "i")
print(df.dtype["cf"].kind == "f")
print(df.dtype["cs"].kind == "U")
print(df.dtype["cb"].kind == "b")

Upvotes: 2

hpaulj
hpaulj

Reputation: 231665

It would help if you gave a concrete example, but I'll demonstrate with @jpp's list:

In [509]: L = [[0.5, True, 'hello'], [1.25, False, 'test']]
In [510]: df = pd.DataFrame(L)
In [511]: df
Out[511]: 
      0      1      2
0  0.50   True  hello
1  1.25  False   test
In [512]: df.dtypes
Out[512]: 
0    float64
1       bool
2     object
dtype: object

pandas doesn't like to use string dtypes, so the last column is object.

In [513]: arr = df.values
In [514]: arr
Out[514]: 
array([[0.5, True, 'hello'],
       [1.25, False, 'test']], dtype=object)

So because of the mix in column dtypes, pandas is making the whole thing object. I don't know pandas well enough to know if you can control the dtype better.

To make a numpy structured array from L, the obvious thing to do is:

In [515]: np.array([tuple(row) for row in L], dtype='f,bool,U10')
Out[515]: 
array([(0.5 ,  True, 'hello'), (1.25, False, 'test')],
      dtype=[('f0', '<f4'), ('f1', '?'), ('f2', '<U10')])

That answers the question of how to specify a different dtype per 'column'. But keep in mind that this array is 1d, and has fields not columns.

But whether it's possible to deduce or set the dtype automatically, that's trickier. It might be possible to build a recarray from the columns, or use one of the functions in np.lib.recfunctions.

If I use a list 'transpose' I can format each column as a separate numpy array.

In [537]: [np.array(col) for col in zip(*L)]
Out[537]: 
[array([0.5 , 1.25]),
 array([ True, False]),
 array(['hello', 'test'], dtype='<U5')]

Then join them into one array with rec.fromarrays:

In [538]: np.rec.fromarrays([np.array(col) for col in zip(*L)])
Out[538]: 
rec.array([(0.5 ,  True, 'hello'), (1.25, False, 'test')],
          dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])

Or I could use genfromtxt to deduce fields from a csv format.

In [526]: np.savetxt('test.txt', np.array(L,object),delimiter=',',fmt='%s')
In [527]: cat test.txt
0.5,True,hello
1.25,False,test

In [529]: data = np.genfromtxt('test.txt',dtype=None,delimiter=',',encoding=None)
In [530]: data
Out[530]: 
array([(0.5 ,  True, 'hello'), (1.25, False, 'test')],
      dtype=[('f0', '<f8'), ('f1', '?'), ('f2', '<U5')])

Upvotes: 3

Matthieu Brucher
Matthieu Brucher

Reputation: 22033

In numpy, an array has the same dtypes for all its entries. So no, it's not possible to have the dedicated/fast float in one column and another one in another column.

That's the point of pandas to allow you to jump from one column with one type to another.

Upvotes: 2

jpp
jpp

Reputation: 164823

Is there a way to obtain (specific) dtypes per column from numpy

No, there isn't. Since your dataframe has mixed types, your NumPy dtype will be object. Such an array is not stored in a contiguous memory block with each column having a fixed dtype. Instead, each value in the 2d array consists of a pointer.

Your question is no different from asking whether you can get the type of each "column" in this list of lists:

L = [[0.5, True, 'hello'], [1.25, False, 'test']]

Since the data in a collection of pointers has no columnar structure, there's no concept of "column dtype". You can test the type of each value for specific indices in each sublist. But this defeats the point of Pandas / NumPy.

Upvotes: 1

Related Questions