Reputation: 13
I'm getting a TypeError: expected string or bytes-like object when I'm processing this dataset using Vaex python library. I've written the following code:
import pyarrow as pa
import vaex
import re
# Reading Data
anime = vaex.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/anime.csv')
user = vaex.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/rating.csv')
# Removing Non-Alphanumeric characters
@vaex.register_function()
def replacer(x):
res = [re.sub('[^A-Za-z]', ' ', value) for value in x.tolist()]
res = [re.sub(' +', ' ', value.lower()) for value in res] # Remove redundant whitespace
return pa.array(res, pa.string())
anime['name_clean'] = anime.func.replacer(anime['name'])
anime = anime[anime['name_clean']!=' '] # Filter empty text
anime['name_clean']
# Merging anime and users
data = user[['user_id', 'anime_id']].join(
anime[['anime_id', 'name_clean']], on='anime_id')['user_id', 'name_clean']
data['user_id'] = data['user_id'].astype('str')
The problem occurs when I do
data['name_clean'].tolist()
When I process the same dataset using pandas everything works fine.
import pandas as pd
# Reading Data
anime = pd.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/anime.csv')
user = pd.read_csv('/content/drive/MyDrive/Temp Datasets/Anime Recommendation Dataset/rating.csv')
# Removing Non-Alphanumeric characters
def replacer(x):
res = re.sub('[^A-Za-z]', ' ', x)
res = re.sub(' +', ' ', res.lower()) # Remove redundant whitespace
return res
anime['name'] = anime['name'].apply(replacer)
anime = anime[anime['name']!=' '] # Filter empty text
# Merging anime and users
data = user[['user_id', 'anime_id']].merge(
anime[['anime_id', 'name']], on='anime_id')
data['user_id'] = data['user_id'].astype('str')
P.S. I think the problem is with using re.sub() with Vaex because when I print data['clean_name'] we can see the type is "string". I can't find any solution or any other way including apply method for removing non-alphanumeric characters in vaex dataframe without causing this problem.
Upvotes: 0
Views: 154
Reputation: 813
I think what is going wrong is your example is that you assume that x
in the registered function replacer
in vaex is a single sample, but in fact vaex processes things in chunks, so x
is an array most likely, and the size might vary.
What I might propose for your particular case is to just use vaex string methods to achieve what you want. They are super fast and convenient. Consider the example, which is similar to yours:
import vaex
# Just as an example dataset, comes with vaex
df = vaex.datasets.titanic()
# Remove numeric characters from column cabin (adjust regex expression as needed)
df['cabin'] = df['cabin'].str.replace(r'\d+', '', regex=True)
# Remove white spaces from column cabin
df['cabin'] = df['cabin'].str.replace(r'\s+', '', regex=True)
# See the outcome
print(df)
Hope this helps.
Upvotes: 0