Tan
Tan

Reputation: 17

Converting Multiple .xlsx Files to .csv - Pandas reading only 1 column

`` Hello everyone, I am working on a deep learning project. The data I will use for the project consists of multiple excel files. Since I will be using the pd.read_csv command of the Pandas library, I used a VBA code that automatically converts all excel files to csv format.

Here is the VBA CODE: (xlsx to csv)

Sub WorkbooksSaveAsCsvToFolder()

'UpdatebyExtendoffice20181031

Dim xObjWB As Workbook

Dim xObjWS As Worksheet

Dim xStrEFPath As String

Dim xStrEFFile As String

Dim xObjFD As FileDialog

Dim xObjSFD As FileDialog

Dim xStrSPath As String

Dim xStrCSVFName As String

Dim xS  As String

    Application.ScreenUpdating = False

    Application.EnableEvents = False

    Application.Calculation = xlCalculationManual

    Application.DisplayAlerts = False

    On Error Resume Next

Set xObjFD = Application.FileDialog(msoFileDialogFolderPicker)

    xObjFD.AllowMultiSelect = False

    xObjFD.Title = "Kutools for Excel - Select a folder which contains Excel files"

    If xObjFD.Show <> -1 Then Exit Sub

    xStrEFPath = xObjFD.SelectedItems(1) & "\"

    Set xObjSFD = Application.FileDialog(msoFileDialogFolderPicker)

 
    xObjSFD.AllowMultiSelect = False

    xObjSFD.Title = "Kutools for Excel - Select a folder to locate CSV files"

    If xObjSFD.Show <> -1 Then Exit Sub

    xStrSPath = xObjSFD.SelectedItems(1) & "\"


    xStrEFFile = Dir(xStrEFPath & "*.xlsx*")


    Do While xStrEFFile <> ""

       xS = xStrEFPath & xStrEFFile

        Set xObjWB = Application.Workbooks.Open(xS)

        xStrCSVFName = xStrSPath & Left(xStrEFFile, InStr(1, xStrEFFile, ".") - 1) & ".csv"

        xObjWB.SaveAs Filename:=xStrCSVFName, FileFormat:=xlCSV

        xObjWB.Close savechanges:=False

        xStrEFFile = Dir

  Loop

    Application.Calculation = xlCalculationAutomatic

    Application.EnableEvents = True

    Application.ScreenUpdating = True

    Application.DisplayAlerts = True

End Sub

With this code, thousands of .xlsx files become .csv. The problem here is that although the conversion happens correctly, when I use the pd.read_csv command, it only reads 1 column.

As it seems:

    0
0   PlatformData,2,0.020000,43.000000,33.000000,32...
1   PlatformData,1,0.020000,42.730087,33.000000,25...
2   PlatformData,2,0.040000,43.000000,33.000000,32...
3   PlatformData,1,0.040000,42.730141,33.000006,25...
4   PlatformData,2,0.060000,43.000000,33.000000,32...
... ...
9520    PlatformData,1,119.520000,42.931132,33.056849,...
9521    PlatformData,1,119.540000,42.931184,33.056868,...
9522    PlatformData,1,119.560000,42.931184,33.056868,...
9523    PlatformData,1,119.580000,42.931237,33.056887,...
9524    PlatformData,1,119.600000,42.931237,33.056887,...

Because the column part is not correct, it combines the data and prevents me from training the model.

Afterwards, in order to understand what the problem was, I saw that the problem disappeared when I converted only 1 excel file to .csv format manually using the "Save as" command and read it using the pandas library.

Which looks like this:

0   1   2   3   4   5   6   7   8   9   10  11
0   PlatformData    2   0.02    43.000000   33.000000   3200.0  0.000000    0.0 0.0 0.000000    0.000000    -0.0
1   PlatformData    1   0.02    42.730087   33.000000   3050.0  60.000029   0.0 0.0 74.999931   129.903854  -0.0
2   PlatformData    2   0.04    43.000000   33.000000   3200.0  0.000000    -0.0    0.0 0.000000    0.000000    -0.0
3   PlatformData    1   0.04    42.730114   33.000064   3050.0  60.000029   0.0 0.0 74.999931   129.903854  -0.0
4   PlatformData    2   0.06    43.000000   33.000000   3200.0  0.000000    -0.0    0.0 0.000000    0.000000    -0.0
... ... ... ... ... ... ... ... ... ... ... ... ...
57867   PlatformData    1   119.72  42.891333   33.019166   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57868   PlatformData    1   119.74  42.891333   33.019166   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57869   PlatformData    1   119.76  42.891387   33.019172   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57870   PlatformData    1   119.78  42.891387   33.019172   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0
57871   PlatformData    1   119.80  42.891441   33.019178   2550.0  5.000000    0.0 0.0 149.429214  13.073360   -0.0

As seen here, each comma is separated as a separate column.

I need to convert multiple files using VBA or some other convert technique because I have so many excel files. But as you can see, even though the format of the files is translated correctly, it is read incorrectly by pandas.

I've tried converting with a bunch of different VBA codes so far. Then I tried to read it with the read_excel command on python and then convert it with to_csv, but I encountered the same problem again. (Reading only 1 column)

What do I need to do to make it look like it was when I changed the format manually? Is there an error in the VBA code or do I need to implement another method for this operation?

Thank you for your interest. Thanks in advance for any help

Upvotes: 1

Views: 339

Answers (2)

wjkw1
wjkw1

Reputation: 144

I'm not sure how to change your OS separator like @FunThomas suggested, perhaps you could instead specify the delimiter used for read_csv() or writing out to_csv()

Have you tried specifying a delimiter? i.e.

import pandas as pd
df = pd.read_csv('Book1.csv', sep='\t')
print(df)

See more here: https://www.geeksforgeeks.org/pandas-dataframe-to-csv-file-using-tab-separator/

Note the link above shows to_csv, but the param sep exists for read_csv too. See docs here.

Upvotes: 1

FunThomas
FunThomas

Reputation: 29286

Dealing with CSV is a tricky thing (not only in Excel). "CSV" stands for "comma separated values", and Excel takes this literally: When you use SaveAs FileFormat:=xlCSV, it will put a comma between your values. Except if you are using local setting on your computer that have a different separator defined, then Excel is using that separator (on my computer, for example, a semicolon).

Your Pandas seems to expect tab characters as separator. You could try SaveAs FileFormat:=xlText or xlTextWindows - on my computer that generated tab separated files, but I couldn't find a documentation telling that this is always the case. The alternative is to use a small routine that writes the file manually - see for example VBA code to save Excel sheet as tab-delimited text file

However, I doubt that you cannot bring Pandas to read comma separated files. According to https://pandas.pydata.org/docs/user_guide/io.html#io-read-csv-table, you should be able to define the separation character.

Upvotes: 1

Related Questions