João Moço
João Moço

Reputation: 109

Pandas DataFrame - Issue regarding column formatting

I have a .txt file that has the data regarding the total number of queries with valid names. The text inside of the file came out of a SQL Server 19 query output. The database used consists of the results of an algorithm that retrieves the most similar brands related to the query inserted. The file looks something like this:

2                    16, 42, 44                                                                                                                                                                                                A MINHA SAÚDE 
3                    34                                                                                                                                                                                                       !D D DUNHILL
4                    33                                                                                                                                                                                                       #MEGA
5                    09                                                                                                                                                                                                       (michelin man)
5                    12                                                                                                                                                                                                       (michelin man)
6                    33                                                                                                                                                                                                       *MONTE DA PEDRA*
7                    35                                                                                                                                                                                                       .FOX
8                    33                                                                                                                                                                                                       @BATISTA'S BY PITADA VERDE
9                    12                                                                                                                                                                                                       @COM
10                   41                                                                                                                                                                                                       + NATUREZA HUMANA
11                   12                                                                                                                                                                                                       001
12                   12                                                                                                                                                                                                       002
13                   12                                                                                                                                                                                                       1007
14                   12                                                                                                                                                                                                       101
15                   12                                                                                                                                                                                                       102
16                   12                                                                                                                                                                                                       104
17                   37                                                                                                                                                                                                       112 PC
18                   33                                                                                                                                                                                                       1128
19                   41                                                                                                                                                                                                       123 PILATES

The 1st column has the Query identifier, the 2nd one has the brand classes where the Query can be located and the 3rd one is the Query itself (the spaces came from the SQL Server output formatting).

I then made a Pandas DataFrame in Google Colaboratory where I wanted the columns to be like the ones in the text file. However, when I ran the code, it gave me this:

enter image description here

The code that I wrote is here:

# Dataframe with the total number of queries with valid names:
df = pd.DataFrame(pd.read_table("/content/drive/MyDrive/data/classes/100/queries100.txt", header=None, names=["Query ID", "Query Name", "Classes Where Query is Present"]))
df

I think that this happens because of the commas in the 2nd column but I'm not quite sure. Any suggestions on why this is happening? I already tried read_csv and read_fwf and they were even worse in terms of formatting.

Upvotes: 0

Views: 75

Answers (1)

AlexK
AlexK

Reputation: 3011

You can use pd.read_fwf() in this case, as your columns have fixed widths:

import pandas as pd

df = pd.read_fwf(
    "/content/drive/MyDrive/data/classes/100/queries100.txt",
    colspecs=[(0,20),(21,40),(40,1000)],
    header=None,
    names=["Query ID", "Query Name", "Classes Where Query is Present"]
)
df.head()
#    Query ID   Query Name  Classes Where Query is Present
# 0         2   16, 42, 44                   A MINHA SAÚDE
# 1         3           34                    !D D DUNHILL
# 2         4           33                           #MEGA
# 3         5           09                  (michelin man)
# 4         5           12                  (michelin man)

Upvotes: 1

Related Questions