Salma Kastalli
Salma Kastalli

Reputation: 31

Problem after using the pd.concat() function - Encoding latin-1 - German Characters

I have a dataset containing German Characters . When Importing 11 datasets with encoding 'ISO-8859-1' or 'latin-1' the column of the German city names is fine but when using pd.concat() with the same encoding the city names change.

encoding='utf-8-sig' do not work and the output of

import sys
print(sys.stdout.encoding)

is UTF-8

# Importing the dataset
skill1 = pd.read_csv('./SkillWizardCityAerospaceEngineering26april.csv',encoding='ISO-8859-1')
skill2 = pd.read_csv('./SkillWizardCityBeautyandCosmetics26april.csv',encoding='ISO-8859-1')
skill3 = pd.read_csv('./SkillWizardCityBusinessSuportFunction26april.csv',encoding='ISO-8859-1')

for all these csv

print(skill1.CityName.unique())

it gives ['Berlin' 'Cologne' 'Dortmund' 'Düsseldorf' 'Frankfurt' 'Halle' 'Hamburg' 'Hannover' 'Heidelberg' 'Ingolstadt' 'Kiel' 'Mannheim' 'Munich' 'Norderstedt' 'Regensburg' 'Stuttgart']

After merge

extension = 'csv'
all_filenames = [i for i in glob.glob('*.{}'.format(extension))]
#combine all files in the list
combined_csv = pd.concat([pd.read_csv(f,encoding='ISO-8859-1') for f in all_filenames ],sort=True)
combined_csv.CityName.unique()

the result is : array(['Berlin', 'Cologne', 'Dortmund', 'Düsseldorf', 'Frankfurt', 'Halle', 'Hamburg', 'Hannover', 'Heidelberg', 'Ingolstadt', 'Kiel', 'Mannheim', 'Munich', 'Norderstedt', 'Regensburg', 'Stuttgart', 'Augsburg', 'Bielefeld', 'Bonn', 'Bremen', 'Chemnitz', 'Darmstadt', 'Dresden', 'Erfurt', 'Essen', 'Göttingen', 'Leipzig', 'Mainz',

I expect it to be ['Berlin' 'Cologne' 'Dortmund' 'Düsseldorf' 'Frankfurt' 'Halle' 'Hamburg' 'Hannover' 'Heidelberg' 'Ingolstadt' 'Kiel' 'Mannheim' 'Munich' 'Norderstedt' 'Regensburg' 'Stuttgart'] I dont want to change them manually i want a solution that will preserve all special characters since there are other columns that were affected.

Upvotes: 2

Views: 1314

Answers (2)

I had the same issue with pd.concat() when concatenating 25 large csv files in a similar loop. In my case they had spanish characters ('ñ' and 'ón'). As implied by Serge, this happens because some of your csvs or part of them do not follow the same encoding you are addressing with pandas' read_csv encoding option (encoding='ISO-8859-1').

This worked for me (a not-so-quick and dirty solution I guess):

After creating the "corrupted" concat, I opened the 'combined_csv' file with Python's built-in open function, specifying the following option (errors = 'backslashreplace') as follows:

file_name = open('D:\combined.csv', encoding='utf-8', errors = 'backslashreplace')
combined = pd.read_csv(file_name)

I then saved the new 'combined' csv file with:

combined.to_csv(r"D:\combined.csv", encoding='utf-8', index=False)

The following snippet may help identifying each file's specific encoding:

with open('D:\one.csv') as f:
   print(f)

Alternatively, you could find out which file does not responds well to pd.read_csv's 'ISO-8859-1' (latin) option and then correct them individually before the concat. But if you have many files that option may be punitive.

Upvotes: 0

Serge Ballesta
Serge Ballesta

Reputation: 149145

This is not a direct answer but it might help to better identify the problem and it contains code, so it is posted as an answer.

My advice would be to try to identify the origin of the badly encoded names. Assuming you have no column named "FILE_ORIGIN", I would try something like:

def build_csv(file):
    df = pd.read_csv(file,encoding='ISO-8859-1')
    df['FILE_ORIGIN'] = file
    return df

combined_csv = pd.concat([build_csv(f) for f in all_filenames ],sort=True)
combined_csv.loc[:,['CityName', 'FILE_ORIGIN']].groupby('CityName').agg(lambda x: ','.join(x))

That should display the files where you get the encoding errors from.

Upvotes: 1

Related Questions