Reputation: 21
I have been trying to clean a particular column from a dataset. I am using the function .apply() multiple times in order to throw out any symbol that could be in in the string values of the column.
For each symbol, here's the function : .apply(lambda x: x.replace("", ""))
Although my code works, it is quite long and not that clean. I would like to know if there is a shorter and/or better manner of cleaning a column.
Here is my code:
df_reviews = pd.read_csv("reviews.csv")
df_reviews = df_reviews.rename(columns={"Unnamed: 0" : "index", "0" : "Name"})
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
df_reviews['name'] = df_reviews['name'].apply(lambda x: x.replace("Review", "")).apply(lambda x: x.replace(":", "")).apply(lambda x: x.replace("'", "")).apply(lambda x: x.replace('"', "")).apply(lambda x: x.replace("#", ""))\
.apply(lambda x: x.replace("{", "")).apply(lambda x: x.replace("}", "")).apply(lambda x: x.replace("_", "")).apply(lambda x: x.replace(":", ""))
df_reviews['name'] = df_reviews['name'].str.strip()
As you can see, the many .apply() functions makes it difficult to clearly see what is getting removed from the "name" column.
Could someone help me ?
Kind regards
Upvotes: 2
Views: 1419
Reputation: 1
import pandas as pd
REMOVE_CHARS = ["Review", ":", "#", "{", "}", "_", "'", '"']
def process_name(name: str) -> str:
for removal_char in REMOVE_CHARS:
try:
print(f"removal char {removal_char}", name.index(removal_char))
name = name.replace(removal_char,"")
except ValueError:
continue
return name
def clean_code(df_reviews: pd.DataFrame):
# Renaming `Unnamed: 0` as `index` ; `0` as `Name`
df_reviews = df_reviews.rename(columns={"Unnamed: 0": "index", "0": "Name"})
# todo: clarification needed
# Here Name col contains a words separated by : so `expand=True` separate it into different columns
# then we just read the zeroth column
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
# Preprocessing of name column
# >> if `name` contains ["Review",":","#","{","}","_","'",'"'] remove/replace it
df_reviews['name'] = df_reviews['name'].apply(lambda x: process_name(x))
df_reviews['name'] = df_reviews['name'].str.strip()
if __name__ == "__main__":
df_reviews = pd.read_csv("reviews.csv")
Upvotes: 0
Reputation: 4105
You can also use regex:
df_reviews['name'] = df_reviews['name'].str.replace('Review|[:\'"#{}_]', "", regex=True)
'Review|[:\'"#{}_]'
Review
: replace the word "Review"|
: or[:\'"#{}_]
- any of these characters within the square brackets []
If you are looking to remove ALL punctuation: you can use this instead
import string
df_reviews['name'] = df_reviews['name'].str.replace(f'Review|[{string.punctuation}]', "", regex=True)
Which will remove the following characters:
!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~
Upvotes: 3
Reputation: 1097
Try this one:
df['name'] = df['name'].str.replace('Review| \:| \'|\"|\#| \_', "").str.strip()
Upvotes: 2