A. Blackmagic
A. Blackmagic

Reputation: 233

After converting a CSV file to Excel, integers are stored as strings - how to convert them back?

In this project I've converted a csv file to an xls file and a txt file to an xls file. The objective is to then compare both xls files for differences and print out any differences to a third excel file.

However, when the differences are printed they include any entry with an integer above 999, as any integer from my converted csv file is treated as a string instead of an integer. Therefore it treats a value such as 1,200 (in my converted xls file) differently from 1200 (in my converted txt file) due to the comma in the converted csv excel file.

My question is: Is there a way to convert the string interpreted integers, back to being interpreted as integers? Otherwise, is there a way to delete all commas from my xls files? I have tried the usual dataframe.replace methodology and it is ineffective.

Below is my code:

#import required libraries
import datetime
import xlrd
import pandas as pd

#define the time_handle function to name the outputted excel files
time_handle = datetime.datetime.now().strftime("%Y%m%d_%H%M")

#identify XM1 file paths (for both csv origin and excel destination)
XM1_csv = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"

#identify XM2 file paths (for both txt origin and excel destination)
XM2_txt = r"filepath"
XM2_excel = r"filepath" + time_handle + ".xlsx"

#remove commas from XM1 excel - failed attempts
#XM1_excel = [col.replace(',', '') for col in XM1_excel]
#XM1_excel = XM1_excel.replace(",", "")
#for line in XM1_excel:
        #XM1_excel.write(line.replace(",", ""))

#remove commas from XM1 CSV - failed attempts
#XM1_csv = [col.replace(',', '') for col in XM1_csv]
#XM1_csv = XM1_csv.replace(",", "")
#for line in XM1_csv:
        #XM1_excel.write(line.replace(",", ""))

#convert the csv XM1 file to an excel file, in the same folder
pd.read_csv(XM1_csv).to_excel(XM1_excel)

#convert the txt XM2 file to an excel file in the same folder
pd.read_csv(XM2_txt, sep="|").to_excel(XM2_excel)



#confirm XM1 filepath
filepath_XM1 = XM1_excel

#confirm XM2 filepath
filepath_XM2 = XM2_excel
#read relevant columns from the excel files
df1 = pd.read_excel(filepath_XM2, sheetname="Sheet1", parse_cols= "H, J, M, U")
df2 = pd.read_excel(filepath_XM1, sheetname="Sheet1", parse_cols= "C, E, G, K")

#remove all commas from XM1 - failed attempts
#df2 = [col.replace(',', '') for col in df2]
#df2 = df2.replace(",", "")
#for line in df2:
        #df2.write(line.replace(",", ""))

#merge the columns from both excel files into one column each respectively
df4 = df1["Exchange Code"] + df1["Product Type"] + df1["Product Description"] + df1["Quantity"].apply(str)
df5 = df2["Exchange"] + df2["Product Type"] + df2["Product Description"] + df2["Quantity"].apply(str)

#concatenate both columns from each excel file, to make one big column containing all the data
df = pd.concat([df4, df5])

#remove all whitespace from each row of the column of data
df=df.str.strip()
df=["".join(x.split()) for x in df]

#convert the data to a dataframe from a series
df = pd.DataFrame({'Value': df})

#remove any duplicates
df.drop_duplicates(subset=None, keep=False, inplace=True)

#print to the console just as a visual aid
print(df)
#output_path = r"filepath"
#print the erroneous entries to an excel file
df.to_excel("XM1_XM2Comparison" + time_handle + ".xls")

Also, I realize the XM1 and XM2 file names with regards to df1 and df2 is a bit confusing, but I simply renamed my files. It makes sense in terms of the files and where they belong in the code!

Thank You

Upvotes: 1

Views: 436

Answers (2)

A. Blackmagic
A. Blackmagic

Reputation: 233

I actually solved this issue with a simple fix for future reference. when reading the csv using pd.read_csv, I added the thousands method so it looks like this:

pd.read_csv(XM1, thousands = ",").to_excel(XM1_excel)

Upvotes: 0

MattR
MattR

Reputation: 5126

You can try an argument called converters on the read-end of the dataframe where you can specify the datatype. Example:

df= pd.read_excel(file, sheetname=YOUR_SHEET_HERE, converters={'FIELD_NAME': str})

converters is both in read_csv and read_excel

Upvotes: 1

Related Questions