Reputation: 145
I am new to python and am writing a program that reads values from a .csv file, then displays a graph that shows the test results compared to the expected output for Benford's Law.
The .csv file has loan values which I need to read in the 1st column like below:
Values Leading Digit Number of occurances
170 1 88
900 9 62
250 2 44
450 4 51
125 1 19
.....
The main file, app.py:
...
filename = filedialog.askopenfilename(filetypes=(
("Excel files", "*.csv"), ("All files", "*.*")))
print(filename)
try:
with open(filename, 'rt') as csvfile:
reader = csv.reader(csvfile, delimiter=',')
next(reader, None) # skip the headers
for row in reader:
minutePriceCloses.append(row[0])
# calculate the percentage distribution of leading digits
benford_test_data_dist = calc.getBenfordDist(minutePriceChanges)
....
in calc.py:
import numpy as np
def getBenfordDist(data):
# set initial dist to zero
dist = [0, 0, 0, 0, 0, 0, 0, 0, 0]
# for each figure, check what the first non-zero digit is, hacky multiply
# by 1000000 to handle small values
for d in data:
# sneaky multiply by 1000000 to ensure that the leading digit is unlikely to be zero
# since benfords law is assumed to relate somehow to scale invariance, this *SHOULDN'T* make a difference
# but it might, so this might all be wrong :-)
s = str(np.abs(d) * 1000000)
for i in range(0, 8):
if(s.startswith(str(i + 1))):
dist[i] = dist[i] + 1
break
# return fractions of the total for each digit
percentDist = []
# convert to % - todo, start using numpy vectors that allow scalar mult/div
for count in dist:
percentDist.append(float(count) / len(data))
# print(float(count))
return percentDist
Now the problem I am having is that the graph output is not correctly displaying the percentage results for the value column count divided by the total number of rows with values i.e for the values with leading digit of 1, the percentage on graph should be 0.25 and so on. There are 352 rows.
Please help. Thanks
Upvotes: 1
Views: 580