324
324

Reputation: 724

Removing _x000D_ from Text Records in Pandas Dataframe

I have a Pandas data frame that I imported from an Excel file. One of the columns looks like the following:

Link
=========
A-324
A-76_x000D_\nA-676
A-95
A-95_x00D_n\nA-495
...

I was able to use regex to remove the \n characters, but I am unable to remove _x000D_. Does anyone know what this is? Why am I unable to use traditional remove methods?

Here is what I've done:

dat['Link'] = dat['Link'].replace("_x000D_", " ")
dat['Link'] = dat['Link'].replace(r'\s+|\\n', ' ', regex=True) 

Upvotes: 2

Views: 8290

Answers (4)

Archimedes Trajano
Archimedes Trajano

Reputation: 41560

Because I don't know where in the sheet this may crop up

df = pd.read_excel(xlsx, sheet_name=sheet)
# Replace all occurrences of '_x000D_' substring in the entire DataFrame
df.replace(to_replace=r"_x000D_", value="", regex=True, inplace=True)

This is Pandas 2

Upvotes: 0

Lukas
Lukas

Reputation: 2312

In case you need to handle all string columns in dataframe (like I needed):

for str_col in df.select_dtypes(include=['object']).columns:
    df[str_col] = df[str_col].astype(str).apply(openpyxl.utils.escape.unescape)

or the same with the replace() function

Upvotes: 3

BenCaldwell
BenCaldwell

Reputation: 443

I also came across this problem when loading text columns containing carriage returns from xlsx files.

This solved it for me:

df['my_col'] = df['my_col'].astype(str).apply(openpyxl.utils.escape.unescape)

Upvotes: 4

imburningbabe
imburningbabe

Reputation: 792

You have to add .str because you are not working directly with a string, but with a Series:

dat = pd.DataFrame({'Link':['A-324','A-76_x000D_\nA-676','A-95','A-95_x000D_n\nA-495']})

dat['Link'] = dat['Link'].str.replace("_x000D_", " ")
dat['Link'] = dat['Link'].replace(r'\s+|\\n', ' ', regex=True) 

Output:

           Link
0         A-324
1    A-76 A-676
2          A-95
3  A-95 n A-495

Upvotes: 2

Related Questions