Reputation: 724
I have a Pandas data frame that I imported from an Excel file. One of the columns looks like the following:
Link
=========
A-324
A-76_x000D_\nA-676
A-95
A-95_x00D_n\nA-495
...
I was able to use regex to remove the \n
characters, but I am unable to remove _x000D_
. Does anyone know what this is? Why am I unable to use traditional remove methods?
Here is what I've done:
dat['Link'] = dat['Link'].replace("_x000D_", " ")
dat['Link'] = dat['Link'].replace(r'\s+|\\n', ' ', regex=True)
Upvotes: 2
Views: 8290
Reputation: 41560
Because I don't know where in the sheet this may crop up
df = pd.read_excel(xlsx, sheet_name=sheet)
# Replace all occurrences of '_x000D_' substring in the entire DataFrame
df.replace(to_replace=r"_x000D_", value="", regex=True, inplace=True)
This is Pandas 2
Upvotes: 0
Reputation: 2312
In case you need to handle all string columns in dataframe (like I needed):
for str_col in df.select_dtypes(include=['object']).columns:
df[str_col] = df[str_col].astype(str).apply(openpyxl.utils.escape.unescape)
or the same with the replace()
function
Upvotes: 3
Reputation: 443
I also came across this problem when loading text columns containing carriage returns from xlsx files.
This solved it for me:
df['my_col'] = df['my_col'].astype(str).apply(openpyxl.utils.escape.unescape)
Upvotes: 4
Reputation: 792
You have to add .str
because you are not working directly with a string, but with a Series:
dat = pd.DataFrame({'Link':['A-324','A-76_x000D_\nA-676','A-95','A-95_x000D_n\nA-495']})
dat['Link'] = dat['Link'].str.replace("_x000D_", " ")
dat['Link'] = dat['Link'].replace(r'\s+|\\n', ' ', regex=True)
Output:
Link
0 A-324
1 A-76 A-676
2 A-95
3 A-95 n A-495
Upvotes: 2