Joe
Joe

Reputation: 468

Why does vaex change column names that contain a period?

When using vaex I came across an unexpected error NameError: name 'column_2_0' is not defined.

After some investigation I found that in my data source (HDF5 file) the column name causing problems is actually called column_2.0 and that vaex renames it to column_2_0 but when performing operations using column names I run into the error. Here is a simple example that reproduces this error:

import pandas as pd
import vaex
cols = ['abc_1', 'abc1', 'abc.1']
vals = list(range(0,len(cols)))
df = pd.DataFrame([vals], columns=cols)
dfv = vaex.from_pandas(df)

for col in dfv.column_names:
    dfv = dfv[dfv[col].notna()]

dfv.count()
...
NameError: name 'abc_1_1' is not defined

In this case it appears that vaex tries to rename abc.1 to abc_1 which is already taken so instead it ends up using abc_1_1.

I know that I can rename the column like dfv.rename('abc_1_1', 'abc_dot_1'), but (a) I'd need to introduce special logic for naming conflicts like in this example where the column name that vaex comes up with is already taken and (b) I'd rather not have to do this manually each time I have a column that contains a period.

I could also enforce all my column names from source data to never use a period but this seems like a stretch given that pandas and other sources where data might come from in general don't have this restriction.

What are some ideas to deal with this problem other than the two I mentioned above?

Upvotes: 0

Views: 859

Answers (1)

Joco
Joco

Reputation: 813

In Vaex the columns are in fact "Expressions". Expressions allow you do build sort of a computational graph behind the scenes as you are doing your regular dataframe operations. However, that requires the column names to be as "clean" as possible.

So column names like '2', or '2.5' are not allows, since the expression system can interpret them as numbers rather than column names. Also column names like 'first-name', the expressions system can interpret as df['first'] - df['name'].

To avoid this, vaex will smartly rename columns so that they can be used in the expression system. This is extremely complicated actually. So in your example above, you've found a case that has not been covered yet (isna/ notna).

Btw, you can always access the original names via df.get_column_names(alias=True).

Upvotes: 1

Related Questions