sewalevrai
sewalevrai

Reputation: 21

Best way to clean column in pandas

I have been trying to clean a particular column from a dataset. I am using the function .apply() multiple times in order to throw out any symbol that could be in in the string values of the column.

For each symbol, here's the function : .apply(lambda x: x.replace("", ""))

Although my code works, it is quite long and not that clean. I would like to know if there is a shorter and/or better manner of cleaning a column.

Here is my code:

df_reviews = pd.read_csv("reviews.csv")
df_reviews = df_reviews.rename(columns={"Unnamed: 0" : "index", "0" : "Name"})
df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]


df_reviews['name'] = df_reviews['name'].apply(lambda x: x.replace("Review", "")).apply(lambda x: x.replace(":", "")).apply(lambda x: x.replace("'", "")).apply(lambda x: x.replace('"', "")).apply(lambda x: x.replace("#", ""))\
                                .apply(lambda x: x.replace("{", "")).apply(lambda x: x.replace("}", "")).apply(lambda x: x.replace("_", "")).apply(lambda x: x.replace(":", ""))



df_reviews['name'] = df_reviews['name'].str.strip()

As you can see, the many .apply() functions makes it difficult to clearly see what is getting removed from the "name" column.

Could someone help me ?

Kind regards

Upvotes: 2

Views: 1419

Answers (3)

LAXMI SARKI
LAXMI SARKI

Reputation: 1

import pandas as pd

REMOVE_CHARS = ["Review", ":", "#", "{", "}", "_", "'", '"']
def process_name(name: str) -> str:
    for removal_char in REMOVE_CHARS:
        try:
            print(f"removal char {removal_char}", name.index(removal_char))
            name = name.replace(removal_char,"")
        except ValueError:
            continue
    return name

def clean_code(df_reviews: pd.DataFrame):
    # Renaming `Unnamed: 0` as `index` ; `0` as `Name`
    df_reviews = df_reviews.rename(columns={"Unnamed: 0": "index", "0": "Name"})
    # todo: clarification needed
    # Here Name col contains a words separated by : so `expand=True` separate it into different columns
    # then we just read the zeroth column
    df_reviews['name'] = df_reviews["Name"].str.split(':', expand=True)[0]
    # Preprocessing of name column
    # >> if `name` contains ["Review",":","#","{","}","_","'",'"'] remove/replace it
    df_reviews['name'] = df_reviews['name'].apply(lambda x: process_name(x))
    df_reviews['name'] = df_reviews['name'].str.strip()


if __name__ == "__main__":
    df_reviews = pd.read_csv("reviews.csv")

Upvotes: 0

ScottC
ScottC

Reputation: 4105

You can also use regex:

df_reviews['name'] = df_reviews['name'].str.replace('Review|[:\'"#{}_]', "", regex=True)


Regex pattern:

'Review|[:\'"#{}_]'
  • Review : replace the word "Review"
  • | : or
  • [:\'"#{}_] - any of these characters within the square brackets []


Note:

If you are looking to remove ALL punctuation: you can use this instead

import string

df_reviews['name'] = df_reviews['name'].str.replace(f'Review|[{string.punctuation}]', "", regex=True)

Which will remove the following characters:

!"#$%&\'()*+,-./:;<=>?@[\\]^_`{|}~

Upvotes: 3

ali bakhtiari
ali bakhtiari

Reputation: 1097

Try this one:

df['name'] = df['name'].str.replace('Review| \:| \'|\"|\#| \_', "").str.strip()

Upvotes: 2

Related Questions