DoomAngelBlade
DoomAngelBlade

Reputation: 53

Read Excel document's cells as string

I am trying to read everything in excel document, that looks like this:

When we read the excel file

xls = pd.ExcelFile('media/No2.xls')

We are getting a problem in column 2. Instead of getting us:

    code
    444444444505
    444444444506
    444444444506
    777777777001
    000000000025
    000000000029
    000000000044
    000000000045
    code
    111111111111
    111111111127
    000000001341

it gives us next:

    code
    444444444505
    444444444506
    444444444506
    777777777001
    25
    29
    44
    45
    code
    111111111111
    111111111127
    1341

We need that 000000001341 to stay as 000000001341, not 1341( and also other numbers with 0s infront of em). In order to fix this issue we tried, well, everything we could find!

dtype='str'\ str \ 'string' \ object \ 'object' \ whatever

converters={"2": object\string}, converters = object\string.

In end? No changes. It still reads 000000001341 as integer and outputs it as 1341.

Upvotes: 1

Views: 4710

Answers (3)

sammywemmy
sammywemmy

Reputation: 28699

One option is to read in the data as individual rows, manipulate the data to your specification to get your adjusted values. This is one way of doing it; I believe you can optimise it further. I will be using xlsx_cells from pyjanitor to read in the data:

# pip install pyjanitor
import pandas as pd
import janitor

content = jn.xlsx_cells('Downloads/TestDocument.xlsx', include_blank_cells=False)

numbers = content.loc[content.number_format.ne('General') & content.data_type.eq('n')]

In [65]: numbers
Out[65]:
    value internal_value coordinate  row  column data_type  is_date number_format
38      1              1        B24   24       2         n    False             0
39      2              2        C24   24       3         n    False             0
40      3              3        G24   24       7         n    False             0
41      9              9        T24   24      20         n    False             0
42     10             10        W24   24      23         n    False             0
43     11             11        Y24   24      25         n    False             0
44     12             12       AB24   24      28         n    False             0
45     13             13       AG24   24      33         n    False             0
46     14             14       AI24   24      35         n    False             0
47     15             15       AL24   24      38         n    False             0
48      1              1        B25   25       2         n    False             0
49      2              2        B26   26       2         n    False             0
51   1341           1341        G26   26       7         n    False   00000000000
52      6              6        W26   26      23         n    False         0.000
53   81.9           81.9        Y26   26      25         n    False          0.00
54  491.4          491.4       AL26   26      38         n    False          0.00
55      3              3        B27   27       2         n    False             0
56      4              4        B28   28       2         n    False             0
57      5              5        B29   29       2         n    False             0
58      6              6        B30   30       2         n    False             0
60   3005           3005        G30   30       7         n    False   00000000000
61     12             12        W30   30      23         n    False         0.000
62   59.4           59.4        Y30   30      25         n    False          0.00
63  712.8          712.8       AL30   30      38         n    False          0.00
64      7              7        B31   31       2         n    False             0
65     10             10        B34   34       2         n    False             0
66     11             11        B35   35       2         n    False             0
67     12             12        B36   36       2         n    False             0
68     13             13        B37   37       2         n    False             0

The number_format gives us what we need; let's tweak that to our advantage:

condition = numbers.number_format.str.contains('.', regex = False)
decimals = np.where(condition, 
                    numbers
                    .number_format
                    .str.split('\.')
                    .str[-1]
                    .str
                    .len(), 
                     0)

lengths = numbers.value.astype(str).str.len()
if_no_decimals = [x[:-z] + str(y) 
                  for x, y, z 
                  in zip(numbers.number_format, numbers.value, lengths)]

decimals = [f"{x:.{y}f}" 
            for x, y 
            in zip(numbers.value, decimals)]

new_values = np.select([~condition, condition], 
                       [if_no_decimals, decimals], 
                       0)

numbers.assign(new_values = new_values)

    value internal_value coordinate  row  column data_type  is_date number_format   new_values
38      1              1        B24   24       2         n    False             0            1
39      2              2        C24   24       3         n    False             0            2
40      3              3        G24   24       7         n    False             0            3
41      9              9        T24   24      20         n    False             0            9
42     10             10        W24   24      23         n    False             0           10
43     11             11        Y24   24      25         n    False             0           11
44     12             12       AB24   24      28         n    False             0           12
45     13             13       AG24   24      33         n    False             0           13
46     14             14       AI24   24      35         n    False             0           14
47     15             15       AL24   24      38         n    False             0           15
48      1              1        B25   25       2         n    False             0            1
49      2              2        B26   26       2         n    False             0            2
51   1341           1341        G26   26       7         n    False   00000000000  00000001341
52      6              6        W26   26      23         n    False         0.000        6.000
53   81.9           81.9        Y26   26      25         n    False          0.00        81.90
54  491.4          491.4       AL26   26      38         n    False          0.00       491.40
55      3              3        B27   27       2         n    False             0            3
56      4              4        B28   28       2         n    False             0            4
57      5              5        B29   29       2         n    False             0            5
58      6              6        B30   30       2         n    False             0            6
60   3005           3005        G30   30       7         n    False   00000000000  00000003005
61     12             12        W30   30      23         n    False         0.000       12.000
62   59.4           59.4        Y30   30      25         n    False          0.00        59.40
63  712.8          712.8       AL30   30      38         n    False          0.00       712.80
64      7              7        B31   31       2         n    False             0            7
65     10             10        B34   34       2         n    False             0           10
66     11             11        B35   35       2         n    False             0           11
67     12             12        B36   36       2         n    False             0           12
68     13             13        B37   37       2         n    False             0           13

In short, the logic for munging or modifying the columns, you know best. Hopefully this sets you in the right direction. Under the hood xlsx_cells uses openpyxl to get the data, and works only with xlsx files - I had to convert the shared data from xls to xlsx.

Upvotes: 0

ouroboros1
ouroboros1

Reputation: 14204

Ok. It turns out, based on the file you supplied, that the values in your file that look like strings are in fact simply integers. They look like string, because they have formatting "Special". E.g.:

enter image description here

But if we apply format "General", we see that it's just a number:

enter image description here

This is the reason why something like xls = pd.read_excel('media/No2.xls', converters={2:str}) has no effect, even if it is applied to the correct column index.

I've turned to xlrd for a different solution. With xlrd you can trace the number format of the cells in the sheets by adding formatting_info=True, when loading the workbook. Basically, the code below iterates over all rows and colums in the first sheet (assuming that this is the correct one) and retrieves the number format of each cell.

E.g. for the cell shown in the image above ("1341", formatted as "00000001341"), the returned value will be "00000000000". For all potential formats of this type ("0","00", etc.), we check if xls.iloc[row,col] contains an integer. If so, we change the integer to a string and add the required leading zeros by overwriting xls.iloc[row,col] with str(xls.iloc[row,col]).zfill(len(format_str)), as required. I've added a print statement for the cells that will be changed, so that you can track in the console, whether all changes make sense.

Code is as follows. Let me know if you still experience any difficulties.

import pandas as pd

# import xls, first sheet with header == None
xls = pd.read_excel('media/No2.xls', header=None)

import xlrd

# add formatting_info=True to allow for retrieving number format
book = xlrd.open_workbook('media/No2.xls', formatting_info=True)

sh = book.sheet_by_index(0)

# xlrd uses 0-index for rows, cols, e.g.: sh.cell(0,0) == A1

# from the mock data I collected the following unique formats.
# I'm assuming we only need to deal with the long strings of zeros
# the code below takes into account the possibility that the real data 
# has more of these variants
# =============================================================================
# formats = ['0',
#            '0000000', # this one
#            '0" "',
#            '00000000000', # this one
#            '#,##0.00',
#            '0.000',
#            '0000000000', # this one
#            '0.00',
#            'General']
# =============================================================================

# iterate over all rows (get length from xls DataFrame)
for row in range(xls.shape[0]):
    
    # iterate over all cols (get length from xls DataFrame)
    for col in range(xls.shape[1]):
        
        # get format cell
        cell = sh.cell(row,col)   
        xf_index = cell.xf_index
        xf = book.xf_list[xf_index]
        format_key = xf.format_key
        format = book.format_map[format_key]
        format_str = format.format_str
        
        # if format like "0", "00", etc. and corresponding value in xls is an integer, define new string
        if len(format_str) == format_str.count('0') and isinstance(xls.iloc[row, col],int):
            temp = str(xls.iloc[row,col]).zfill(len(format_str))
            
            # if new string (== temp) is same as the one already in xls, do nothing
            # else: overwrite value
            if str(xls.iloc[row,col]) != temp:
                print(f'r{row} format: {format_str}; \timport: {xls.iloc[row,col]};' +
                      f'\tchanged to: {str(xls.iloc[row,col]).zfill(len(format_str))}')
                xls.iloc[row,col] = str(xls.iloc[row,col]).zfill(len(format_str))

Upvotes: 2

ouroboros1
ouroboros1

Reputation: 14204

Try either:

xls = pd.read_excel('media/No2.xls', converters={'code':str})

or:

xls = pd.read_excel('media/No2.xls', converters={2:str})

By default pd.read_excel will use the first row in the sheet as the column names of the DataFrame. So, to set the conversion, you need to reference the correct column either by name ('code') or by index (2, but without qoutes!).

Upvotes: 1

Related Questions