hp.
hp.

Reputation: 513

Parsing file in php with no delimiter

I am working on a project that calls for parsing out some financial files into a database. The files have no delimiters, each record is essentially a huge string and they give me the map of what positions are what field.

Here is one record:

GCA0000000188T0544291021BW1079    021BW1079    JOE88TSUSD2015041420150414000000001053060070+000000001053060070+000000001053060070+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000000000000+000000000001053060+000000000001053060+000000000000263265+000000000000000000+021BW107904 042 000000000000000000+00000000000000000000000000000000000000000000PEPPERS AN034   04100003ALTERNATIVE FOCUS   BLUETREND L.P.      HOLDBACK                                                                           NUSD000000000010530600+0000000000+000000010000000000+000000000010530600+USD 

And the mapping goes something like this:

position 000-002 -> Field X

position 003-012 -> Field Y

position 013-023 -> Field Z

and so on.

Question here: whats the best way to do this? I am on a php stack. I was thinking about creating methods for each field, passing in the whole string, and returning the values. Is there a package that I can use to help?

Upvotes: 0

Views: 188

Answers (2)

user3942918
user3942918

Reputation: 26375

sscanf

The built-in function sscanf() works well for fixed-width fields like this - given the proper format it'll parse it out into an array. The format is described in the documentation for sprintf() (which is its counterpart) with only a few differences.

Here it's pretty simple:

  • %3s - 3 byte long string
  • %10s - 10 byte long string
  • %11s - 11 byte long string

Example 1:

$fields = sscanf($record, '%3s%10s%11s');
var_dump($fields);

Result:

array(3) {
  [0]=>
  string(3) "GCA"
  [1]=>
  string(10) "0000000188"
  [2]=>
  string(11) "T0544291021"
}

If instead you'd prefer for example to:

  • treat the second field as a signed decimal and return it as an integer
  • skip the following T
  • treat the third field as a signed decimal as well and also return it as an integer

You could use:

  • %3s - 3 byte long string
  • %10d - 10 character wide signed decimal
  • T - literal "T" (not returned)
  • %10d - 10 character wide signed decimal

Example 2:

$fields = sscanf($record, '%3s%10dT%10d');
var_dump($fields);

Result:

array(3) {
  [0]=>
  string(3) "GCA"
  [1]=>
  int(188)
  [2]=>
  int(544291021)
}

unpack

An alternative would be to use unpack(), whose format information you can find in the documentation for its counterpart pack()

  • A3Field X - Unpack 3 bytes as a string into "Field X"
  • A10Field Y - Unpack 10 bytes as a string into "Field Y"
  • A11Field Z - Unpack 11 bytes as a string into "Field Z"

Example 3:

$fields = unpack('A3Field X/A10Field Y/A11Field Z', $record);
var_dump($fields);

Result:

array(3) {
  ["Field X"]=>
  string(3) "GCA"
  ["Field Y"]=>
  string(10) "0000000188"
  ["Field Z"]=>
  string(11) "T0544291021"
}

Whether to use sscanf or unpack depends on the nature of your data and how exactly you need to treat it: sscanf does well interpreting characters in the string while unpack does well interpreting the bytes that make it up (it's more useful for binary data.)

For example given the string "1" with unpack you can get the value of the byte it consists of (i.e. its ASCII value):

$fields = unpack('c', "1");
var_dump($fields[1]); //int(49)

unpack is powerful but complex, you can find more information on it at http://perldoc.perl.org/perlpacktut.html

Upvotes: 2

MasterOdin
MasterOdin

Reputation: 7886

The easiest solution is probably to just use substr if you have what columns to use.

$string = /* the really long string */
$field_x = substr($string, 0, 3);
$field_y = substr($string, 3, 9);
/* etc */

Upvotes: 0

Related Questions