Reputation: 205
I am ingesting data using CSV files into my my postgres DB through AWS and I am encountering some issues whereby some of the data is not in UTF-8 format. I want to identify which rows in my data are causing the issue so that I can address at source.
I have been trying to use chardet to give me what I need but can't seem to get it to output the encoding type row by row. I have also tried to used the below which, similar to chardet will tell me if the whole file is a particular encoding or not, but not what rows are causing the issue
import codecs
encodings = ['utf-8','windows-1250', 'windows-1252'] # add more
for e in encodings:
fh ='filename.csv', 'r', encoding=e)
except UnicodeDecodeError:
print('got unicode error with %s , trying different encoding' % e)
print('opening the file with encoding: %s ' % e)
any help appreciated!
Example text causing issue below:
Aĺi Hùssaini Buķar Falmatàmi Mohammad Bùlama Mùstapaha Maiďugu Shu"ibu Aĺi Àdamu Ja"o Khaìcalla Makanikì Alì Mòhammad Zaŕami Dànkabo Kelĺumi Umàra Goŕoma Gaptomì Àli Àhmed Àbdullahi Shafi"u Mohammed! Hassañ Aùwal Usaìni Mohàmmed Goniķaka Abdullé ÑGABGL17401598 MUSA ĶAUMI NGAMCH17051ĺ535 NGBOJEGB1708ààaaÅQààp NGYOGJBY3215` NGBOJEAG1709Ź ÙNGBOKD1T17090240 ÑGBOMDMK17100381 ÑGBOMDMK17100382 ÑGBOMDMK17100383 ÑGBOMDMK17100384 ÑGBOMDMK17100385 ÑGBOMDMK17100387 ÑGBOMDMK17100388 ÑGBOMDMK17100389 ÑGBOMDMK17100390 ÑGBOMDMK17100392 ÑGBOMDMK17100393 ÑGBOMDMK17100394 ÑGBOMDMK17100395 ÑGBOMDMK17100396 ÑGBOMDMK17100397 ÑGBOMDMK17100398 ÑGBOMDMK17100399 ÑGBOMDMK17100400 ÑGBOMDMK17100401 ÑGBOMDMK17100402 ÑGBOMDMK17100403 ÑGBOMDMK17100419 Yyģggghyuuiiiuyttttrrrrrrŕ NĢBÒJEGM17100245 NĢBÒJEGM17100479 NĢBÒJEGM17100493 NĢBÒJEGM17100495 NĢBÒJEGM17100524 NĢBÒJEGM17100525 ÑGYOGJGJ122112 ÑGYOYFKG3824 Ngyoýfmy4736 NGBOJEFC1804aaà NGBOJEFC1804à8131 NGYÒGDAKW0717 NGYÒGDAK20609 NGBÒMMST19056545 NGBOMDNY88J00233!! ÀNGYOGDAK21907436 NGBODAAC19110390]
Upvotes: 0
Views: 249