Jcmoney1010
Jcmoney1010

Reputation: 922

Importing XML Data into Database

Being the sports nerd that I am, I'm looking to take daily XML files that are produced by the Major League Baseball Website, and import them into either an Access or MySQL database. The issue I'm running into, is that almost every XML file they produce is just slightly different than the last. For example, one game file may have a field named batter23 that is next to event22 while the other file calls it batter24and is next to pitcher25. I know that XML files can be inconsistent, but I know there has to be a way to consistently get the data into a database. Is there anyway to standardize these XML files? Some code that will parse each file in a list, and organize them into a specific style and giving them consistent field names? Currently I import the XML file into a Excel sheet first, where I change the file type to a CSV, but from there the field names and column locations are still different from file to file.

My goal is to have all the files in a structure where I can quickly import them into a database each day, without having to manually change column locations, or field names. I'm open to any and all options, but my experience in most languages are rookie level at best, so forgive me for my lack of knowledge.

Upvotes: 1

Views: 636

Answers (1)

David K-J
David K-J

Reputation: 928

The files are pretty standard as far as XML goes.., you just have to figure what each file represents.

I did a quick look around a Red Sox v Royals game from September 14. (Go Sox!) In year_2014/month_09/day_14/gid_2014_09_14_bosmlb_kcamlb_1/players.xml I can see that Ortiz has an id of 120074.

If I look in batters for his player Id, I can see his stats for that game. (year_2014/month_09/day_14/gid_2014_09_14_bosmlb_kcamlb_1/batters/120074.xml)

It goes on. Basically, in order to load these files into a database, you will have perform some level of processing for them to make any sense. The IDs don't appear to change between games, but I only took a cursory glance.

As for loading the data, XML::Simple in perl can take an XML and spit out a perl data structure very easily. Unless you need something more heavy duty, this should cover you.

Loading the players.xml:

#!/bin/env perl
use strict; use warnings;

use Data::Dumper;
use XML::Simple;

my $players_xml = XMLin('players.xml');

print Dumper $xml;

Gives you something like:

$VAR1 = {
    'venue' => 'Kauffman Stadium',
    'date' => 'September 14, 2014',
    'team' => {
        'Boston Red Sox' => {
            'id' => 'BOS',
            'player' => {
                '605141' => {
                    'avg' => '.283',
                    'team_abbrev' => 'BOS',
                    'parent_team_id' => '111',
                    'hr' => '4',
                    'team_id' => '111',
                    'status' => 'A',
                    'last' => 'Betts',
                    'rl' => 'R',
                    'parent_team_abbrev' => 'BOS',
                    'first' => 'Mookie',
                    'rbi' => '12',
                    'game_position' => '2B',
                    'num' => '50',
                    'position' => '2B',
                    'current_position' => '2B',
                    'boxname' => 'Betts',
                    'bats' => 'R',
                    'bat_order' => '1'
                },
                ...

It's then trivial to navigate these hashes and insert DB rows as you like.

Upvotes: 2

Related Questions