Reputation: 53
I am trying to read everything in excel document, that looks like this:
When we read the excel file
xls = pd.ExcelFile('media/No2.xls')
We are getting a problem in column 2. Instead of getting us:
code
444444444505
444444444506
444444444506
777777777001
000000000025
000000000029
000000000044
000000000045
code
111111111111
111111111127
000000001341
it gives us next:
code
444444444505
444444444506
444444444506
777777777001
25
29
44
45
code
111111111111
111111111127
1341
We need that 000000001341
to stay as 000000001341
, not 1341
( and also other numbers with 0s infront of em). In order to fix this issue we tried, well, everything we could find!
dtype='str'\ str \ 'string' \ object \ 'object' \ whatever
converters={"2": object\string}, converters = object\string.
In end? No changes. It still reads 000000001341
as integer and outputs it as 1341
.
Upvotes: 1
Views: 4710
Reputation: 28699
One option is to read in the data as individual rows, manipulate the data to your specification to get your adjusted values. This is one way of doing it; I believe you can optimise it further. I will be using xlsx_cells from pyjanitor to read in the data:
# pip install pyjanitor
import pandas as pd
import janitor
content = jn.xlsx_cells('Downloads/TestDocument.xlsx', include_blank_cells=False)
numbers = content.loc[content.number_format.ne('General') & content.data_type.eq('n')]
In [65]: numbers
Out[65]:
value internal_value coordinate row column data_type is_date number_format
38 1 1 B24 24 2 n False 0
39 2 2 C24 24 3 n False 0
40 3 3 G24 24 7 n False 0
41 9 9 T24 24 20 n False 0
42 10 10 W24 24 23 n False 0
43 11 11 Y24 24 25 n False 0
44 12 12 AB24 24 28 n False 0
45 13 13 AG24 24 33 n False 0
46 14 14 AI24 24 35 n False 0
47 15 15 AL24 24 38 n False 0
48 1 1 B25 25 2 n False 0
49 2 2 B26 26 2 n False 0
51 1341 1341 G26 26 7 n False 00000000000
52 6 6 W26 26 23 n False 0.000
53 81.9 81.9 Y26 26 25 n False 0.00
54 491.4 491.4 AL26 26 38 n False 0.00
55 3 3 B27 27 2 n False 0
56 4 4 B28 28 2 n False 0
57 5 5 B29 29 2 n False 0
58 6 6 B30 30 2 n False 0
60 3005 3005 G30 30 7 n False 00000000000
61 12 12 W30 30 23 n False 0.000
62 59.4 59.4 Y30 30 25 n False 0.00
63 712.8 712.8 AL30 30 38 n False 0.00
64 7 7 B31 31 2 n False 0
65 10 10 B34 34 2 n False 0
66 11 11 B35 35 2 n False 0
67 12 12 B36 36 2 n False 0
68 13 13 B37 37 2 n False 0
The number_format gives us what we need; let's tweak that to our advantage:
condition = numbers.number_format.str.contains('.', regex = False)
decimals = np.where(condition,
numbers
.number_format
.str.split('\.')
.str[-1]
.str
.len(),
0)
lengths = numbers.value.astype(str).str.len()
if_no_decimals = [x[:-z] + str(y)
for x, y, z
in zip(numbers.number_format, numbers.value, lengths)]
decimals = [f"{x:.{y}f}"
for x, y
in zip(numbers.value, decimals)]
new_values = np.select([~condition, condition],
[if_no_decimals, decimals],
0)
numbers.assign(new_values = new_values)
value internal_value coordinate row column data_type is_date number_format new_values
38 1 1 B24 24 2 n False 0 1
39 2 2 C24 24 3 n False 0 2
40 3 3 G24 24 7 n False 0 3
41 9 9 T24 24 20 n False 0 9
42 10 10 W24 24 23 n False 0 10
43 11 11 Y24 24 25 n False 0 11
44 12 12 AB24 24 28 n False 0 12
45 13 13 AG24 24 33 n False 0 13
46 14 14 AI24 24 35 n False 0 14
47 15 15 AL24 24 38 n False 0 15
48 1 1 B25 25 2 n False 0 1
49 2 2 B26 26 2 n False 0 2
51 1341 1341 G26 26 7 n False 00000000000 00000001341
52 6 6 W26 26 23 n False 0.000 6.000
53 81.9 81.9 Y26 26 25 n False 0.00 81.90
54 491.4 491.4 AL26 26 38 n False 0.00 491.40
55 3 3 B27 27 2 n False 0 3
56 4 4 B28 28 2 n False 0 4
57 5 5 B29 29 2 n False 0 5
58 6 6 B30 30 2 n False 0 6
60 3005 3005 G30 30 7 n False 00000000000 00000003005
61 12 12 W30 30 23 n False 0.000 12.000
62 59.4 59.4 Y30 30 25 n False 0.00 59.40
63 712.8 712.8 AL30 30 38 n False 0.00 712.80
64 7 7 B31 31 2 n False 0 7
65 10 10 B34 34 2 n False 0 10
66 11 11 B35 35 2 n False 0 11
67 12 12 B36 36 2 n False 0 12
68 13 13 B37 37 2 n False 0 13
In short, the logic for munging or modifying the columns, you know best. Hopefully this sets you in the right direction. Under the hood xlsx_cells
uses openpyxl to get the data, and works only with xlsx
files - I had to convert the shared data from xls to xlsx.
Upvotes: 0
Reputation: 14204
Ok. It turns out, based on the file you supplied, that the values in your file that look like strings are in fact simply integers. They look like string, because they have formatting "Special". E.g.:
But if we apply format "General", we see that it's just a number:
This is the reason why something like xls = pd.read_excel('media/No2.xls', converters={2:str})
has no effect, even if it is applied to the correct column index.
I've turned to xlrd
for a different solution. With xlrd
you can trace the number format of the cells in the sheets by adding formatting_info=True
, when loading the workbook. Basically, the code below iterates over all rows and colums in the first sheet (assuming that this is the correct one) and retrieves the number format of each cell.
E.g. for the cell shown in the image above ("1341", formatted as "00000001341"), the returned value will be "00000000000". For all potential formats of this type ("0","00", etc.), we check if xls.iloc[row,col]
contains an integer. If so, we change the integer to a string and add the required leading zeros by overwriting xls.iloc[row,col]
with str(xls.iloc[row,col]).zfill(len(format_str))
, as required. I've added a print statement for the cells that will be changed, so that you can track in the console, whether all changes make sense.
Code is as follows. Let me know if you still experience any difficulties.
import pandas as pd
# import xls, first sheet with header == None
xls = pd.read_excel('media/No2.xls', header=None)
import xlrd
# add formatting_info=True to allow for retrieving number format
book = xlrd.open_workbook('media/No2.xls', formatting_info=True)
sh = book.sheet_by_index(0)
# xlrd uses 0-index for rows, cols, e.g.: sh.cell(0,0) == A1
# from the mock data I collected the following unique formats.
# I'm assuming we only need to deal with the long strings of zeros
# the code below takes into account the possibility that the real data
# has more of these variants
# =============================================================================
# formats = ['0',
# '0000000', # this one
# '0" "',
# '00000000000', # this one
# '#,##0.00',
# '0.000',
# '0000000000', # this one
# '0.00',
# 'General']
# =============================================================================
# iterate over all rows (get length from xls DataFrame)
for row in range(xls.shape[0]):
# iterate over all cols (get length from xls DataFrame)
for col in range(xls.shape[1]):
# get format cell
cell = sh.cell(row,col)
xf_index = cell.xf_index
xf = book.xf_list[xf_index]
format_key = xf.format_key
format = book.format_map[format_key]
format_str = format.format_str
# if format like "0", "00", etc. and corresponding value in xls is an integer, define new string
if len(format_str) == format_str.count('0') and isinstance(xls.iloc[row, col],int):
temp = str(xls.iloc[row,col]).zfill(len(format_str))
# if new string (== temp) is same as the one already in xls, do nothing
# else: overwrite value
if str(xls.iloc[row,col]) != temp:
print(f'r{row} format: {format_str}; \timport: {xls.iloc[row,col]};' +
f'\tchanged to: {str(xls.iloc[row,col]).zfill(len(format_str))}')
xls.iloc[row,col] = str(xls.iloc[row,col]).zfill(len(format_str))
Upvotes: 2
Reputation: 14204
Try either:
xls = pd.read_excel('media/No2.xls', converters={'code':str})
or:
xls = pd.read_excel('media/No2.xls', converters={2:str})
By default pd.read_excel will use the first row in the sheet as the column names of the DataFrame. So, to set the conversion, you need to reference the correct column either by name ('code') or by index (2, but without qoutes!).
Upvotes: 1