user2772056
user2772056

Reputation: 205

Identify in list what data is not in UTF-8 format

I am ingesting data using CSV files into my my postgres DB through AWS and I am encountering some issues whereby some of the data is not in UTF-8 format. I want to identify which rows in my data are causing the issue so that I can address at source.

I have been trying to use chardet to give me what I need but can't seem to get it to output the encoding type row by row. I have also tried to used the below which, similar to chardet will tell me if the whole file is a particular encoding or not, but not what rows are causing the issue

 import codecs
#PYTHON
encodings = ['utf-8','windows-1250', 'windows-1252'] # add more
for e in encodings:
                try:
                    fh = codecs.open('filename.csv', 'r', encoding=e)
                    fh.readlines()
                    fh.seek(0)
                except UnicodeDecodeError:
                    print('got unicode error with %s , trying different encoding' % e)
                else:
                    print('opening the file with encoding:  %s ' % e)
                    break  

any help appreciated!

Example text causing issue below:

Aĺi
Hùssaini
Buķar
Falmatàmi
Mohammad  Bùlama
Mùstapaha
Maiďugu
Shu"ibu
Aĺi
Àdamu
Ja"o
Khaìcalla
Makanikì
Alì
Mòhammad
Zaŕami
Dànkabo
Kelĺumi
Umàra
Goŕoma
Gaptomì
Àli
Àhmed
Àbdullahi
Shafi"u
Mohammed!
Hassañ
Aùwal
Usaìni
Mohàmmed
Goniķaka
Abdullé
ÑGABGL17401598
MUSA ĶAUMI
NGAMCH17051ĺ535
NGBOJEGB1708ààaaÅQààp
NGYOGJBY3215`
NGBOJEAG1709Ź
ÙNGBOKD1T17090240
ÑGBOMDMK17100381
ÑGBOMDMK17100382
ÑGBOMDMK17100383
ÑGBOMDMK17100384
ÑGBOMDMK17100385
ÑGBOMDMK17100387
ÑGBOMDMK17100388
ÑGBOMDMK17100389
ÑGBOMDMK17100390
ÑGBOMDMK17100392
ÑGBOMDMK17100393
ÑGBOMDMK17100394
ÑGBOMDMK17100395
ÑGBOMDMK17100396
ÑGBOMDMK17100397
ÑGBOMDMK17100398
ÑGBOMDMK17100399
ÑGBOMDMK17100400
ÑGBOMDMK17100401
ÑGBOMDMK17100402
ÑGBOMDMK17100403
ÑGBOMDMK17100419
Yyģggghyuuiiiuyttttrrrrrrŕ
NĢBÒJEGM17100245
NĢBÒJEGM17100479
NĢBÒJEGM17100493
NĢBÒJEGM17100495
NĢBÒJEGM17100524
NĢBÒJEGM17100525
ÑGYOGJGJ122112
ÑGYOYFKG3824
Ngyoýfmy4736
NGBOJEFC1804aaà
NGBOJEFC1804à8131
NGYÒGDAKW0717
NGYÒGDAK20609
NGBÒMMST19056545
NGBOMDNY88J00233!!
ÀNGYOGDAK21907436
NGBODAAC19110390]

Upvotes: 0

Views: 249

Answers (0)

Related Questions