Reputation: 189
I am trying to open en transform several DBF files to a dataframe. Most of them worked fine, but for one of the files I receive the error: "UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 15: invalid start byte"
I have read this error on some other topics such as opening csv and xlsx and other files. The proposed solution was to include encoding = 'utf-8'
in the reading the file part. I haven't found a solution for DBF files unfortunately and I have very limited knowledge on DBF files.
What I have tried so far:
1)
from dbfread import DBF
dbf = DBF('file.DBF')
dbf = pd.DataFrame(dbf)
UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 8: character maps to <undefined>
2)
from simpledbf import Dbf5
dbf = Dbf5('file.DBF')
dbf = dbf.to_dataframe()
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xf6 in position 15: invalid start byte
3)
# this block of code copied from https://gist.github.com/ryan-hill/f90b1c68f60d12baea81
import pysal as ps
def dbf2DF(dbfile, upper=True): #Reads in DBF files and returns Pandas DF
db = ps.table(dbfile) #Pysal to open DBF
d = {col: db.by_col(col) for col in db.header} #Convert dbf to dictionary
#pandasDF = pd.DataFrame(db[:]) #Convert to Pandas DF
pandasDF = pd.DataFrame(d) #Convert to Pandas DF
if upper == True: #Make columns uppercase if wanted
pandasDF.columns = map(str.upper, db.header)
db.close()
return pandasDF
dfb = dbf2DF('file.DBF')
AttributeError: module 'pysal' has no attribute 'open'
And last, if I try to install the dbfpy
module, I receive:
SyntaxError: invalid syntax
Any suggestions on how to solve this?
Upvotes: 5
Views: 19008
Reputation: 1
import dbfread
import pandas as pd
dbf = dbfread.DBF('file.dbf')
print(dbf.encoding) #this will print the dbf encoding
dbf.encoding = 'utf8' #just change it to utf-8
pd.DataFrame(dbf)
It's possible to access a dbf encoding and change it to utf-8.
Upvotes: 0
Reputation: 1
For all those who helped me on this issue for myself where I had to fix a corrupt .dbf file (so came from a .dbf and had to be returned to a .dbf). My particular issue was dates throughout the .dbf were... just very wrong... and tried and failed via many methods, with many errors, to crack and reassemble it... before succeeding with the below:
#Modify dbase3 file to recast null date fields as a default date and
#reimport back into dbase3 file
import collections
import datetime
from typing import OrderedDict
import dbf as dbf1
from simpledbf import Dbf5
from dbfread import DBF, FieldParser
import pandas as pd
import numpy as np
#Default date to overwrite NaN values
blank_date = datetime.date(1900, 1, 1)
#Read in dbase file from Old Path and point to new Path
old_path = r"C:\...\ex.dbf"
new_path = r"C:\...\newex.dbf"
#Establish 1st rule for resolving corrupted dates
class MyFieldParser(FieldParser):
def parse(self, field, data):
try:
return FieldParser.parse(self, field, data)
except ValueError:
return blank_date
#Collect the original .DBF data while stepping over any errors
table = DBF(old_path, None, True, False, MyFieldParser, collections.OrderedDict, False, False, False,'ignore')
#Grab the Header Name, Old School Variable Format, and number of characters/length for each variable
dbfh = Dbf5(old_path, codec='utf-8')
headers = dbfh.fields
hdct = {x[0]: x[1:] for x in headers}
hdct.pop('DeletionFlag')
keys = hdct.keys()
#Position of Type and Length relative to field name
ftype = 0
characters = 1
# Reformat and join all old school DBF Header fields in required format
fields = list()
for key in keys:
ftemp = hdct.get(key)
k1 = str(key)
res1 = ftemp[ftype]
res2 = ftemp[characters]
if k1 == "decimal_field_name":
fields.append(k1 + " " + res1 + "(" + str(res2) + ",2)")
elif res1 == 'N':
fields.append(k1 + " " + res1 + "(" + str(res2) + ",0)")
elif res1 == 'D':
fields.append(k1 + " " + res1)
elif res1 == 'L':
fields.append(k1 + " " + res1)
else:
fields.append(k1 + " " + res1 + "(" + str(res2) + ")")
addfields = '; '.join(str(f) for f in fields)
#load the records of the.dbf into a dataframe
df = pd.DataFrame(iter(table))
#go ham reformatting date fields to ensure they are in the correct format
df['DATE_FIELD1'] = df['DATE_FIELD1'].replace(np.nan, blank_date)
df['DATE_FIELD1'] = pd.to_datetime(df['DATE_FIELD1'])
# eliminate further errors in the dataframe
df = df.fillna('0')
#drop added "record index" field from dataframe
df.set_index('existing_primary_key', inplace=False)
#initialize defaulttdict and convert the dataframe into a .DBF appendable format
dd = collections.defaultdict(list)
records = df.to_dict('records',into=dd)
#create the new .DBF file
new_table = dbf1.Table(new_path, addfields)
#append the dataframe to the new .DBF file
new_table.open(mode=dbf1.READ_WRITE)
for record in records:
new_table.append(record)
new_table.close()
Upvotes: 0
Reputation: 502
install library DBF
conda install DBF
from dbfread import DBF
db_in_dbf = DBF('paht/database.dbf)
this line uplodad the database
df = pd.DataFrame(db_in_dbf )
this line converts a dataframe of pandas
Upvotes: 0
Reputation: 502
from simpledbf import Dbf5
dbf2 = Dbf5('/Users/.../TCAT_MUNICIPIOS.dbf', codec='latin')
df2 = dbf2.to_dataframe()
df2.head(3)
Upvotes: 0
Reputation: 69288
Try using my dbf
library:
import dbf
table = dbf.Table('file.DBF')
Print it to see if an encoding is present in the file:
print table # print(table) in Python 3
One of my test tables looks like this:
Table: tempy.dbf
Type: dBase III Plus
Codepage: ascii (plain ol ascii)
Status: DbfStatus.CLOSED
Last updated: 2019-07-26
Record count: 1
Field count: 2
Record length: 31
--Fields--
0) name C(20)
1) desc M
The important line being the Codepage
line -- it sounds like that is not properly set for your DBF
file. If you know what it should be, you can either open it with that codepage (temporarily) with:
table = dbf.Table('file.DBF', codepage='...')
Or you can change it permanently (updates the DBF
file) with:
table.open()
table.codepage = dbf.CodePage('cp1252') # for example
table.close()
Upvotes: 4