gernophil
gernophil

Reputation: 495

Does polars not distinguish between tuple and list?

For Pandas I sometimes cast nested lists to tuples e.g. to be able to drop duplicates (being aware that order of the elements would matter). For Polars there does not seem to be a difference between lists and tuples. I can't find anymore info on this. Could someone elaborate this a little?

import polars as pl

dftuple_pl = pl.DataFrame({"col1": [("a", "a"), ("a", "a")],
                           "col2": [("b", "b"), ("b", "b")]})

dflist_pl = pl.DataFrame({"col1": [["a", "a"], ["a", "a"]],
                          "col2": [["b", "b"], ["b", "b"]]})

print(dftuple_pl.equals(dflist_pl))
# True

print(dftuple_pl.unique())
# shape: (1, 2)
# ┌────────────┬────────────┐
# │ col1       ┆ col2       │
# │ ---        ┆ ---        │
# │ list[str]  ┆ list[str]  │
# ╞════════════╪════════════╡
# │ ["a", "a"] ┆ ["b", "b"] │
# └────────────┴────────────┘

print(dflist_pl.unique())
# shape: (1, 2)
# ┌────────────┬────────────┐
# │ col1       ┆ col2       │
# │ ---        ┆ ---        │
# │ list[str]  ┆ list[str]  │
# ╞════════════╪════════════╡
# │ ["a", "a"] ┆ ["b", "b"] │
# └────────────┴────────────┘
import pandas as pd

dftuple_pd = pd.DataFrame({"col1": [("a", "a"), ("a", "a")],
                           "col2": [("b", "b"), ("b", "b")]})

dflist_pd = pd.DataFrame({"col1": [["a", "a"], ["a", "a"]],
                          "col2": [["b", "b"], ["b", "b"]]})

print(dftuple_pd.equals(dflist_pd))
# False

print(dftuple_pd.drop_duplicates())
#      col1    col2
# 0  (a, a)  (b, b)

print(dflist_pd.drop_duplicates())
# TypeError: unhashable type: 'list'

So, is it useless (or maybe even impossible) to cast columns to tuples in Polars (e.g. using .map_elements(tuple)?

Upvotes: 2

Views: 121

Answers (1)

etrotta
etrotta

Reputation: 268

Polars does not uses Python types, but rather has its own data types (See the User Guide and Reference).

This also applies to Pandas, and both Polars and Pandas have an Object data type they use for generic data types they do not have explicit support for. In the case of pandas, this includes strings, tuples, and lists.

You could explicitly tell it to use the pl.Object data type when creating the DataFrame, but using the Polars datatypes is more efficient than using generic python Objects overall. Just like you would use pl.String for strings, you should use pl.List for lists and tuples.

(Alternatively, you can use pl.Array if your nested data has a fixed length, or pl.Struct if you want nested columns)

import polars as pl
# Seriously, don't do this. Avoid pl.Object() as much as possible.
df = pl.DataFrame([
    pl.Series("tuples", [("a", "a"), ("a", "a")], dtype=pl.Object()),
    pl.Series("lists", [["b", "b"], ["b", "b"]], dtype=pl.Object()),
])

Upvotes: 2

Related Questions