NimbleTortoise
NimbleTortoise

Reputation: 365

Numpy Separating CSV into columns

I'm trying to use a CSV imported from bballreference.com. But as you can see, the separated values are all in one row rather than separated by columns. On NumPy Pandas, what would be the easiest way to fix this? I've googled to no avail.

csv on jupyter

I don't know how to post CSV file in a clean way but here it is:

",,,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Shooting,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game"
"Rk,Player,Age,G,GS,MP,FG,FGA,3P,3PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,MP,PTS,TRB,AST,STL,BLK"
"1,Kevin Durant\duranke01,29,5,5,182,54,107,9,28,22,27,3,34,37,24,7,6,10,7,139,.505,.321,.815,36.5,27.8,7.4,4.8,1.4,1.2"
"2,Klay Thompson\thompkl01,27,5,5,183,38,99,12,43,11,11,3,29,32,9,1,2,6,11,99,.384,.279,1.000,36.7,19.8,6.4,1.8,0.2,0.4"
"3,Stephen Curry\curryst01,29,4,3,125,32,67,15,34,19,19,2,19,21,14,8,2,15,6,98,.478,.441,1.000,31.2,24.5,5.3,3.5,2.0,0.5"
"4,Draymond Green\greendr01,27,5,5,186,27,55,8,20,12,15,12,47,59,50,12,8,18,16,74,.491,.400,.800,37.1,14.8,11.8,10.0,2.4,1.6"
"5,Andre Iguodala\iguodan01,34,5,4,140,14,29,4,12,7,12,4,21,25,17,10,2,3,7,39,.483,.333,.583,27.9,7.8,5.0,3.4,2.0,0.4"
"6,Quinn Cook\cookqu01,24,4,0,58,12,27,0,10,6,8,1,8,9,4,1,0,2,4,30,.444,.000,.750,14.4,7.5,2.3,1.0,0.3,0.0"
"7,Kevon Looney\looneke01,21,5,0,113,12,17,0,0,4,8,10,19,29,5,4,1,2,17,28,.706,,.500,22.6,5.6,5.8,1.0,0.8,0.2"
"8,Shaun Livingston\livinsh01,32,5,0,79,11,27,0,0,4,4,0,6,6,12,0,1,3,9,26,.407,,1.000,15.9,5.2,1.2,2.4,0.0,0.2"
"9,David West\westda01,37,5,0,40,8,14,0,0,0,0,2,5,7,13,2,4,3,4,16,.571,,,7.9,3.2,1.4,2.6,0.4,0.8"
"10,Nick Young\youngni01,32,4,2,41,3,11,3,10,2,3,0,4,4,1,1,0,1,3,11,.273,.300,.667,10.2,2.8,1.0,0.3,0.3,0.0"
"11,JaVale McGee\mcgeeja01,30,3,1,19,3,8,0,1,0,0,4,2,6,0,0,1,0,2,6,.375,.000,,6.2,2.0,2.0,0.0,0.0,0.3"
"12,Zaza Pachulia\pachuza01,33,2,0,8,1,2,0,0,2,4,4,2,6,0,2,0,1,1,4,.500,,.500,4.2,2.0,3.0,0.0,1.0,0.0"
"13,Jordan Bell\belljo01,23,4,0,23,1,4,0,0,1,2,1,5,6,5,2,2,0,2,3,.250,,.500,5.8,0.8,1.5,1.3,0.5,0.5"
"14,Damian Jones\jonesda03,22,1,0,3,0,1,0,0,2,2,0,0,0,0,0,0,0,0,2,.000,,1.000,3.2,2.0,0.0,0.0,0.0,0.0"
",Team Totals,26.5,5,,1200,216,468,51,158,92,115,46,201,247,154,50,29,64,89,575,.462,.323,.800,240.0,115.0,49.4,30.8,10.0,5.8"

Upvotes: 0

Views: 124

Answers (2)

Parfait
Parfait

Reputation: 107652

Consider reading in csv as a text file to be stripped of the beginning/end quotes per line on a text file read which tell the parser all data between is a singular value. And use built-in StringIO to read text string into dataframe instead of saving to disk for import.

Additionally, skip the first row of repeated Totals and Per Game and even the last row that aggregates since you can do that with pandas.

from io import StringIO
import pandas as pd

with open('BasketballCSVQuotes.csv') as f:
    csvdata = f.read().replace('"', '')

df = pd.read_csv(StringIO(csvdata), skiprows=1, skipfooter=1, engine='python')

print(df)

Output

      Rk                      Player   Age  G   GS    MP   FG  FGA  3P  3PA  ...    PTS    FG%    3P%    FT%   MP.1  PTS.1  TRB.1  AST.1  STL.1  BLK.1
0    1.0      Kevin Durant\duranke01  29.0  5  5.0   182   54  107   9   28  ...    139  0.505  0.321  0.815   36.5   27.8    7.4    4.8    1.4    1.2
1    2.0     Klay Thompson\thompkl01  27.0  5  5.0   183   38   99  12   43  ...     99  0.384  0.279  1.000   36.7   19.8    6.4    1.8    0.2    0.4
2    3.0     Stephen Curry\curryst01  29.0  4  3.0   125   32   67  15   34  ...     98  0.478  0.441  1.000   31.2   24.5    5.3    3.5    2.0    0.5
3    4.0    Draymond Green\greendr01  27.0  5  5.0   186   27   55   8   20  ...     74  0.491  0.400  0.800   37.1   14.8   11.8   10.0    2.4    1.6
4    5.0    Andre Iguodala\iguodan01  34.0  5  4.0   140   14   29   4   12  ...     39  0.483  0.333  0.583   27.9    7.8    5.0    3.4    2.0    0.4
5    6.0         Quinn Cook\cookqu01  24.0  4  0.0    58   12   27   0   10  ...     30  0.444  0.000  0.750   14.4    7.5    2.3    1.0    0.3    0.0
6    7.0      Kevon Looney\looneke01  21.0  5  0.0   113   12   17   0    0  ...     28  0.706    NaN  0.500   22.6    5.6    5.8    1.0    0.8    0.2
7    8.0  Shaun Livingston\livinsh01  32.0  5  0.0    79   11   27   0    0  ...     26  0.407    NaN  1.000   15.9    5.2    1.2    2.4    0.0    0.2
8    9.0         David West\westda01  37.0  5  0.0    40    8   14   0    0  ...     16  0.571    NaN    NaN    7.9    3.2    1.4    2.6    0.4    0.8
9   10.0        Nick Young\youngni01  32.0  4  2.0    41    3   11   3   10  ...     11  0.273  0.300  0.667   10.2    2.8    1.0    0.3    0.3    0.0
10  11.0      JaVale McGee\mcgeeja01  30.0  3  1.0    19    3    8   0    1  ...      6  0.375  0.000    NaN    6.2    2.0    2.0    0.0    0.0    0.3
11  12.0     Zaza Pachulia\pachuza01  33.0  2  0.0     8    1    2   0    0  ...      4  0.500    NaN  0.500    4.2    2.0    3.0    0.0    1.0    0.0
12  13.0           Jordan Belelljo01  23.0  4  0.0    23    1    4   0    0  ...      3  0.250    NaN  0.500    5.8    0.8    1.5    1.3    0.5    0.5
13  14.0      Damian Jones\jonesda03  22.0  1  0.0     3    0    1   0    0  ...      2  0.000    NaN  1.000    3.2    2.0    0.0    0.0    0.0    0.0

[14 rows x 30 columns]

Upvotes: 0

Logstar
Logstar

Reputation: 472

It seems that the first two rows of your CSV file are headers, but the default behavior of pd.read_csv thinks that only the first row is header.

Also, the beginning and trailing quotes make pd.read_csv think the text in between is a single field/column.

You could try the following:

Remove the beginning and trailing quotes, and

bbal = pd.read_csv('some_file.csv', header=[0, 1], delimiter=',')

Following is how you could use Python to remove the beginning and trailing quotes:

# open 'quotes.csv' in read mode with variable in_file as handle
# open 'no_quotes.csv' in write mode with variable out_file as handle
with open('quotes.csv') as in_file, open('no_quotes.csv', 'w') as out_file:
    # read in_file line by line
    # the variable line stores each line as string
    for line in in_file:
        # line[1:-1] slices the string to omit the first and last character
        # append a newline character '\n' to the sliced line
        # write the string with newline to out_file
        out_file.write(line[1:-1] + '\n')

# read_csv on 'no_quotes.csv'
bbal = pd.read_csv('no_quotes.csv', header=[0, 1], delimiter=',')
bbal.head()

Upvotes: 2

Related Questions