Reputation: 365
I'm trying to use a CSV imported from bballreference.com. But as you can see, the separated values are all in one row rather than separated by columns. On NumPy Pandas, what would be the easiest way to fix this? I've googled to no avail.
I don't know how to post CSV file in a clean way but here it is:
",,,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Totals,Shooting,Shooting,Shooting,Per Game,Per Game,Per Game,Per Game,Per Game,Per Game"
"Rk,Player,Age,G,GS,MP,FG,FGA,3P,3PA,FT,FTA,ORB,DRB,TRB,AST,STL,BLK,TOV,PF,PTS,FG%,3P%,FT%,MP,PTS,TRB,AST,STL,BLK"
"1,Kevin Durant\duranke01,29,5,5,182,54,107,9,28,22,27,3,34,37,24,7,6,10,7,139,.505,.321,.815,36.5,27.8,7.4,4.8,1.4,1.2"
"2,Klay Thompson\thompkl01,27,5,5,183,38,99,12,43,11,11,3,29,32,9,1,2,6,11,99,.384,.279,1.000,36.7,19.8,6.4,1.8,0.2,0.4"
"3,Stephen Curry\curryst01,29,4,3,125,32,67,15,34,19,19,2,19,21,14,8,2,15,6,98,.478,.441,1.000,31.2,24.5,5.3,3.5,2.0,0.5"
"4,Draymond Green\greendr01,27,5,5,186,27,55,8,20,12,15,12,47,59,50,12,8,18,16,74,.491,.400,.800,37.1,14.8,11.8,10.0,2.4,1.6"
"5,Andre Iguodala\iguodan01,34,5,4,140,14,29,4,12,7,12,4,21,25,17,10,2,3,7,39,.483,.333,.583,27.9,7.8,5.0,3.4,2.0,0.4"
"6,Quinn Cook\cookqu01,24,4,0,58,12,27,0,10,6,8,1,8,9,4,1,0,2,4,30,.444,.000,.750,14.4,7.5,2.3,1.0,0.3,0.0"
"7,Kevon Looney\looneke01,21,5,0,113,12,17,0,0,4,8,10,19,29,5,4,1,2,17,28,.706,,.500,22.6,5.6,5.8,1.0,0.8,0.2"
"8,Shaun Livingston\livinsh01,32,5,0,79,11,27,0,0,4,4,0,6,6,12,0,1,3,9,26,.407,,1.000,15.9,5.2,1.2,2.4,0.0,0.2"
"9,David West\westda01,37,5,0,40,8,14,0,0,0,0,2,5,7,13,2,4,3,4,16,.571,,,7.9,3.2,1.4,2.6,0.4,0.8"
"10,Nick Young\youngni01,32,4,2,41,3,11,3,10,2,3,0,4,4,1,1,0,1,3,11,.273,.300,.667,10.2,2.8,1.0,0.3,0.3,0.0"
"11,JaVale McGee\mcgeeja01,30,3,1,19,3,8,0,1,0,0,4,2,6,0,0,1,0,2,6,.375,.000,,6.2,2.0,2.0,0.0,0.0,0.3"
"12,Zaza Pachulia\pachuza01,33,2,0,8,1,2,0,0,2,4,4,2,6,0,2,0,1,1,4,.500,,.500,4.2,2.0,3.0,0.0,1.0,0.0"
"13,Jordan Bell\belljo01,23,4,0,23,1,4,0,0,1,2,1,5,6,5,2,2,0,2,3,.250,,.500,5.8,0.8,1.5,1.3,0.5,0.5"
"14,Damian Jones\jonesda03,22,1,0,3,0,1,0,0,2,2,0,0,0,0,0,0,0,0,2,.000,,1.000,3.2,2.0,0.0,0.0,0.0,0.0"
",Team Totals,26.5,5,,1200,216,468,51,158,92,115,46,201,247,154,50,29,64,89,575,.462,.323,.800,240.0,115.0,49.4,30.8,10.0,5.8"
Upvotes: 0
Views: 124
Reputation: 107652
Consider reading in csv as a text file to be stripped of the beginning/end quotes per line on a text file read which tell the parser all data between is a singular value. And use built-in StringIO
to read text string into dataframe instead of saving to disk for import.
Additionally, skip the first row of repeated Totals and Per Game and even the last row that aggregates since you can do that with pandas.
from io import StringIO
import pandas as pd
with open('BasketballCSVQuotes.csv') as f:
csvdata = f.read().replace('"', '')
df = pd.read_csv(StringIO(csvdata), skiprows=1, skipfooter=1, engine='python')
print(df)
Output
Rk Player Age G GS MP FG FGA 3P 3PA ... PTS FG% 3P% FT% MP.1 PTS.1 TRB.1 AST.1 STL.1 BLK.1
0 1.0 Kevin Durant\duranke01 29.0 5 5.0 182 54 107 9 28 ... 139 0.505 0.321 0.815 36.5 27.8 7.4 4.8 1.4 1.2
1 2.0 Klay Thompson\thompkl01 27.0 5 5.0 183 38 99 12 43 ... 99 0.384 0.279 1.000 36.7 19.8 6.4 1.8 0.2 0.4
2 3.0 Stephen Curry\curryst01 29.0 4 3.0 125 32 67 15 34 ... 98 0.478 0.441 1.000 31.2 24.5 5.3 3.5 2.0 0.5
3 4.0 Draymond Green\greendr01 27.0 5 5.0 186 27 55 8 20 ... 74 0.491 0.400 0.800 37.1 14.8 11.8 10.0 2.4 1.6
4 5.0 Andre Iguodala\iguodan01 34.0 5 4.0 140 14 29 4 12 ... 39 0.483 0.333 0.583 27.9 7.8 5.0 3.4 2.0 0.4
5 6.0 Quinn Cook\cookqu01 24.0 4 0.0 58 12 27 0 10 ... 30 0.444 0.000 0.750 14.4 7.5 2.3 1.0 0.3 0.0
6 7.0 Kevon Looney\looneke01 21.0 5 0.0 113 12 17 0 0 ... 28 0.706 NaN 0.500 22.6 5.6 5.8 1.0 0.8 0.2
7 8.0 Shaun Livingston\livinsh01 32.0 5 0.0 79 11 27 0 0 ... 26 0.407 NaN 1.000 15.9 5.2 1.2 2.4 0.0 0.2
8 9.0 David West\westda01 37.0 5 0.0 40 8 14 0 0 ... 16 0.571 NaN NaN 7.9 3.2 1.4 2.6 0.4 0.8
9 10.0 Nick Young\youngni01 32.0 4 2.0 41 3 11 3 10 ... 11 0.273 0.300 0.667 10.2 2.8 1.0 0.3 0.3 0.0
10 11.0 JaVale McGee\mcgeeja01 30.0 3 1.0 19 3 8 0 1 ... 6 0.375 0.000 NaN 6.2 2.0 2.0 0.0 0.0 0.3
11 12.0 Zaza Pachulia\pachuza01 33.0 2 0.0 8 1 2 0 0 ... 4 0.500 NaN 0.500 4.2 2.0 3.0 0.0 1.0 0.0
12 13.0 Jordan Belelljo01 23.0 4 0.0 23 1 4 0 0 ... 3 0.250 NaN 0.500 5.8 0.8 1.5 1.3 0.5 0.5
13 14.0 Damian Jones\jonesda03 22.0 1 0.0 3 0 1 0 0 ... 2 0.000 NaN 1.000 3.2 2.0 0.0 0.0 0.0 0.0
[14 rows x 30 columns]
Upvotes: 0
Reputation: 472
It seems that the first two rows of your CSV file are headers, but the default behavior of pd.read_csv
thinks that only the first row is header.
Also, the beginning and trailing quotes make pd.read_csv
think the text in between is a single field/column.
You could try the following:
Remove the beginning and trailing quotes, and
bbal = pd.read_csv('some_file.csv', header=[0, 1], delimiter=',')
Following is how you could use Python to remove the beginning and trailing quotes:
# open 'quotes.csv' in read mode with variable in_file as handle
# open 'no_quotes.csv' in write mode with variable out_file as handle
with open('quotes.csv') as in_file, open('no_quotes.csv', 'w') as out_file:
# read in_file line by line
# the variable line stores each line as string
for line in in_file:
# line[1:-1] slices the string to omit the first and last character
# append a newline character '\n' to the sliced line
# write the string with newline to out_file
out_file.write(line[1:-1] + '\n')
# read_csv on 'no_quotes.csv'
bbal = pd.read_csv('no_quotes.csv', header=[0, 1], delimiter=',')
bbal.head()
Upvotes: 2