JayD Journal
JayD Journal

Reputation: 101

Type Hint For NamedTuple Returned By Pandas DataFrame itertuples()

ITERTUPLES is a nice way to iterate over a pandas DF and it returns a namedtuple.

import pandas as pd
import numpy as np

df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},index=['dog', 'hawk'])
for row in df.itertuples():
    print(type(row))
    print(row)
<class 'pandas.core.frame.Pandas'>
Pandas(Index='dog', num_legs=4, num_wings=0)
<class 'pandas.core.frame.Pandas'>
Pandas(Index='hawk', num_legs=2, num_wings=2)

What is a correct way if any to add type hints to the returned namedtuples ?

Upvotes: 10

Views: 2158

Answers (4)

Bravhek
Bravhek

Reputation: 331

One posible solution, in case the column names and data types are fixed, is to declare explicitly the data structure of the df row as a NamedTuple:

from typing import NamedTuple
import pandas as pd


class Row(NamedTuple):
    num_legs: int
    num_wings: int


data = {"num_legs": [4, 2], "num_wings": [0, 2]}
df = pd.DataFrame(data, index=["dog", "hawk"])
row: Row
for row in df.itertuples(name="Row"):
    print(row.num_legs)

Upvotes: 2

Igor
Igor

Reputation: 1737

Here's a slightly modified version of the Bravhek's answer, but with type checking:

from typing import NamedTuple
import pandas as pd
from typing import get_type_hints

Row = NamedTuple(
    "Animal",
    [("Index", str), ("num_legs", int), ("num_wings", int)],
)

df = pd.DataFrame(
    {"num_legs": [4, 2, 'a'], "num_wings": [0, 2, 3]}, index=["dog", "hawk", "bad_record"]
)

# Just a protocol type hint:
row: Row
for row in df.itertuples():
    print(row.num_legs)


# Actual type checking:
if set(Row._fields) != set(df.columns.tolist()) | {'Index'}:
    print('columns mismatch')
for row in df.itertuples():
    for fn in Row._fields:
        if not isinstance(getattr(row,fn), get_type_hints(Row)[fn]):
            print('type mismatch in column "{}", row "{}"'.format(fn, row))
    print(row.num_legs)

It prints the following:

4
2
a
4
2
type mismatch in column "num_legs", row "Pandas(Index='bad_record', num_legs='a', num_wings=3)"
a

The protocol type hint part could be useful to silence IDE warnings (e.g. PyCharm "unresolved attribute reference"), but it would not validate anything.

Upvotes: 0

ProsperousHeart
ProsperousHeart

Reputation: 321

import pandas as pd
import numpy as np

df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},index=['dog', 'hawk'])
for row in df.itertuples():
    print(type(row))
    print(row)

You'll notice that the type is pandas.core.frame.Pandas -- but this gives an error type checking. You'll need to type check for pd.core.frame.pandas

import pandas as pd
import numpy as np

def test_chk(row2chk: pd.core.frame.pandas):
    print(row2chk)
    print(row2chk.num_legs)  # prints the value in the num_legs column

df = pd.DataFrame({'num_legs': [4, 2], 'num_wings': [0, 2]},index=['dog', 'hawk'])
for row in df.itertuples():
    print(type(row))
    test_chk(row)

Upvotes: -1

user12734467
user12734467

Reputation:

I don't think its possible, because your dataframe can have any arbitrary data type, and thus the tuples will have any arbitrary data type present in the dataframe. In the same way you can't use Python type hints to specify the column types of a DataFrame, you can't explicitly type those named tuples.

If you need the type information of the columns before going into your for loop, you can certainly use df.dtypes, which gives you a Series with the column types.

Upvotes: 2

Related Questions