plshelpme_
plshelpme_

Reputation: 57

How to remove special characters from csv using pandas

Currently cleaning data from a csv file. Successfully mad everything lowercase, removed stopwords and punctuation etc. But need to remove special characters. For example, the csv file contains things such as 'César' '‘disgrace’'. If there is a way to replace these characters then even better but I am fine with removing them. Below is the code I have so far.

import pandas as pd
from nltk.corpus import stopwords
import string
from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()

pd.read_csv('soccer.csv', encoding='utf-8')
df = pd.read_csv('soccer.csv')

df.columns = ['post_id', 'post_title', 'subreddit']
df['post_title'] = df['post_title'].str.lower().str.replace(r'[^\w\s]+', '').str.split()


stop = stopwords.words('english')

df['post_title'] = df['post_title'].apply(lambda x: [item for item in x if item not in stop])

df['post_title']= df['post_title'].apply(lambda x : [lemma.lemmatize(y) for y in x])


df.to_csv('clean_soccer.csv')

Upvotes: 3

Views: 9358

Answers (3)

VnC
VnC

Reputation: 2026

When saving the file try:

df.to_csv('clean_soccer.csv', encoding='utf-8-sig')

or simply

df.to_csv('clean_soccer.csv', encoding='utf-8')

Upvotes: 2

xibalba1
xibalba1

Reputation: 544

As an alternative to other answers, you could use string.printable:

import string

printable = set(string.printable)

def remove_spec_chars(in_str):
    return ''.join([c for c in in_str if c in printable])

df['post_title'].apply(remove_spec_chars)

For reference, string.printable varies by machine, which is a combination of digits, ascii_letters, punctuation, and whitespace.

For your example string César' '‘disgrace’' this function returns 'Csardisgrace'.

https://docs.python.org/3/library/string.html
How can I remove non-ASCII characters but leave periods and spaces using Python?

Upvotes: 1

charlzee
charlzee

Reputation: 26

I'm not sure if there's an easy way to replace the special characters, but I know how you can remove them. Try using:

df['post_title']= df['post_title'].str.replace(r'[^A-Za-z0-9]+', '')

That should replace 'César' '‘disgrace’' with 'Csardisgrace'. Hope this helps.

Upvotes: 0

Related Questions