Reputation: 495
For Pandas I sometimes cast nested lists to tuples e.g. to be able to drop duplicates (being aware that order of the elements would matter). For Polars there does not seem to be a difference between lists and tuples. I can't find anymore info on this. Could someone elaborate this a little?
import polars as pl
dftuple_pl = pl.DataFrame({"col1": [("a", "a"), ("a", "a")],
"col2": [("b", "b"), ("b", "b")]})
dflist_pl = pl.DataFrame({"col1": [["a", "a"], ["a", "a"]],
"col2": [["b", "b"], ["b", "b"]]})
print(dftuple_pl.equals(dflist_pl))
# True
print(dftuple_pl.unique())
# shape: (1, 2)
# ┌────────────┬────────────┐
# │ col1 ┆ col2 │
# │ --- ┆ --- │
# │ list[str] ┆ list[str] │
# ╞════════════╪════════════╡
# │ ["a", "a"] ┆ ["b", "b"] │
# └────────────┴────────────┘
print(dflist_pl.unique())
# shape: (1, 2)
# ┌────────────┬────────────┐
# │ col1 ┆ col2 │
# │ --- ┆ --- │
# │ list[str] ┆ list[str] │
# ╞════════════╪════════════╡
# │ ["a", "a"] ┆ ["b", "b"] │
# └────────────┴────────────┘
import pandas as pd
dftuple_pd = pd.DataFrame({"col1": [("a", "a"), ("a", "a")],
"col2": [("b", "b"), ("b", "b")]})
dflist_pd = pd.DataFrame({"col1": [["a", "a"], ["a", "a"]],
"col2": [["b", "b"], ["b", "b"]]})
print(dftuple_pd.equals(dflist_pd))
# False
print(dftuple_pd.drop_duplicates())
# col1 col2
# 0 (a, a) (b, b)
print(dflist_pd.drop_duplicates())
# TypeError: unhashable type: 'list'
So, is it useless (or maybe even impossible) to cast columns to tuples in Polars (e.g. using .map_elements(tuple)
?
Upvotes: 2
Views: 121
Reputation: 268
Polars does not uses Python types, but rather has its own data types (See the User Guide and Reference).
This also applies to Pandas, and both Polars and Pandas have an Object
data type they use for generic data types they do not have explicit support for. In the case of pandas, this includes strings, tuples, and lists.
You could explicitly tell it to use the pl.Object
data type when creating the DataFrame, but using the Polars datatypes is more efficient than using generic python Objects overall. Just like you would use pl.String
for strings, you should use pl.List
for lists and tuples.
(Alternatively, you can use pl.Array
if your nested data has a fixed length, or pl.Struct
if you want nested columns)
import polars as pl
# Seriously, don't do this. Avoid pl.Object() as much as possible.
df = pl.DataFrame([
pl.Series("tuples", [("a", "a"), ("a", "a")], dtype=pl.Object()),
pl.Series("lists", [["b", "b"], ["b", "b"]], dtype=pl.Object()),
])
Upvotes: 2