Reputation: 145
I'm doing a project with this data repository, and for each football season, on top of the CSV file with the game data, there is an extra README which has the final results. This is in some sort of table like structure, and I want to read it into a Pandas dataframe. I've tried both "read_csv" and "read_table", but I'm not sure what delimiter is being used and if it possibly uses a multi-index... The MD file is as follows:
- Home - - Away - - Total -
Pld W D L F:A W D L F:A F:A +/- Pts
1. Club Brugge 34 14 2 1 45:12 11 4 2 38:18 83:30 +53 81
2. RSC Anderlecht 34 14 1 2 44:14 8 4 5 39:23 83:37 +46 71
3. Germinal Beerschot 34 12 3 2 34:11 3 5 9 19:26 53:37 +16 53
4. RWD Molenbeek 34 9 6 2 24:9 4 8 5 15:20 39:29 +10 53
5. K Lierse SK 34 8 4 5 30:21 6 6 5 24:24 54:45 +9 52
6. Standard Liège 34 9 7 1 28:15 4 5 8 23:31 51:46 +5 51
7. Sporting Charleroi 34 8 7 2 37:21 5 4 8 22:32 59:53 +6 50
8. Cercle Brugge 34 7 5 5 27:23 6 5 6 24:24 51:47 +4 49
9. KFC Lommel SK 34 7 6 4 20:15 7 0 10 20:30 40:45 -5 48
10. SC Eendracht Aalst 34 8 6 3 37:21 4 4 9 18:29 55:50 +5 46
11. KV Mechelen 34 8 4 5 20:16 4 4 9 20:30 40:46 -6 44
12. KRC Harelbeke 34 8 1 8 26:26 5 3 9 14:22 40:48 -8 43
13. Royal Antwerp FC 34 7 4 6 26:23 4 5 8 12:23 38:46 -8 42
14. KAA Gent 34 8 2 7 21:22 2 9 6 18:27 39:49 -10 41
15. Sint-Truidense VV 34 7 4 6 29:28 4 3 10 13:32 42:60 -18 40
16. RFC Seraing 34 5 4 8 18:24 3 1 13 17:51 35:75 -40 29
17. KSK Beveren 34 4 7 6 24:25 2 2 13 14:32 38:57 -19 27
18. SV Zulte Waregem 34 3 4 10 18:36 1 5 11 12:34 30:70 -40 21
Pld = Matches; W = Matches won; D = Matches drawn; L = Matches lost; F = Goals for; A = Goals against; +/- = Goal differencence; Pts = Points
How can I best read this file? Cheers!
Upvotes: 0
Views: 3002
Reputation: 1784
A quick way could be:
df = pd.read_table('data.txt', sep='\s{2,}', header=1, engine='python')
Pld W D L F:A W.1 D.1 L.1 F:A.1 F:A.2 +/- Pts
1. Club Brugge 34 14 2 1 45:12 11 4 2 38:18 83:30 53 81
2. RSC Anderlecht 34 14 1 2 44:14 8 4 5 39:23 83:37 46 71
3. Germinal Beerschot 34 12 3 2 34:11 3 5 9 19:26 53:37 16 53
...
Explanation
sep='\s{2,}'
will split the data by 2 or more spacesengine='python'
is required for sep='\s{2,}'
to workTo clean up the column names you could set the prefix for each column manually:
# Manually set column prefix for each column
k = ['Pld'] + ['Home']*4 + ['Away']*4 + ['Total']*3
# Drop duplicate column name suffix (e.g. D.1 --> D)
cols = [re.sub('\.[0-9]*$', '', c) for c in df.columns]
# Update column names
df.columns = [f'{k}_{c}' for k, c in zip(k, cols)]
Upvotes: 0
Reputation: 11391
This file isn't in a particular format - it's meant to be human-readable rather than machine readable. So you'll probably need to do some conversion on your own first.
One simple way:
lines = text.split('\n')
df = pd.DataFrame([re.split(r'\s+', line[34:]) for line in lines])
You can name the columns directly:
df.columns = [['home_pld', 'home_w', 'home_d', 'home_l', 'home_fa',
'home_pld', 'home_w', 'home_d', 'home_l', 'home_fa',
'total_fa', 'total_plusminus', 'total_points']]
And add the club name:
df['club'] = [line[4:34].strip() for line in lines]
Upvotes: 1
Reputation: 379
This file is not formatted as a CSV (which can use both ',' and ';' delimiters. In your case you only have spaces to work with, so the approach would be for every line to split it by space char, get rid of empty entries and fetch them by index.
f = open("your.csv", 'r')
for l in f.readlines():
vals = [l for l in l.split(' ') if l]
index = vals[0]
name = vals[1]
goals_fa = tuple(vals[6].split(':'))
...
# fill dataframe
f.close()
Upvotes: 1