Parsing a NASDAQ .tip file

Question

Problem: I have a .tip file from NASDAQ that I need to parse. Official name: GENIUM CONSOLIDATED FEED
The file i a csv like file with semicolons and newline for new entries of different structure, and so no constant header. But it has a corresponding xsd schemafile which should describe the contents and structure, but I can see no clear way to go from the file to a structure result. Have tried with a list setup where messageType becomes a name in a list

x <- scan("cran_tasks/NOMX_Comm_Close2.tip", what="", sep="
")
y <- strsplit(x, ';')
names(y) <- sapply(y, `[[`, 1)
y <- sapply(y, `[`, -1, simplify = FALSE)
y <- sapply(y, as.list)

The file is structured like this:

messageType;key1Value;key2Value;...;..;/n  
messageType;key1Value;key2Value;.....;/n    

BDSr;i2;NAmGITS;  
BDx;i106;Si18;s2;SYmNC;NAmNASDAQ OMX Commodities;CNyNO;MIcNORX;  
BDm;i672;Si018171;s2;Ex106;NAmFuel Oil;SYmNCFO;TOTa+0200;LDa20141011;  
BDIs;i10142;SiNP;s2;ISsNP;NAmNord Pool ASA;  
m;i122745;t191500.001;Dt20170509;ISOcY;ISOtY;
m;i122745;t192808.721;Dt20170509;ISOcN;ISOtY;SEp275.45;  
Oi;i122745;t054425.600;OPi2840;

I have had a working sql code set to parse the file but it have shown to be to case specific to be robust against even minor changes in structure, like the order of the different keyValue pairs. So I'm looking for at way to exploit the the structure of the information to be able to make a robust and maintainable solution, preferably in R. I have tried with some regular expressions matching but still I end up with a lot of context specific code, so I hope the some structuring with a table or dataframe containing the Key information can make for a sustainable solution.

Any hints or suggestions are more than welcome.

link to the XML/XSD file and the html sheet specifying keys, and a .tip file

TIP Message Format The TIP protocol is a tagged text protocol. A TIP message is a sequence of tag and value pairs separated with semicolon. A tag is zero or more UPPERCASE characters followed by a lowercase character. The tag is followed immediately by the value. Examples of tags are "FLd", "STa". The first tag in a message is always the message type. The message type tag has no value. An example of a message type tag is "BDSh". IP messages are encoded with UTF-8 unless stated otherwise. The maximum length of a TIP message is indicated with the constant MAX_MESSAGE_LENGTH (2048 bytes). Any max field length excludes any escape characters '\'. No empty values will be sent; exceptions are message type tags and Boolean tags (the presence of the tag itself corresponds to a 'true' value). For a decimal field (i.e. the Float data type) the length is given as X,Y where X is the max number of digits in the integer part of the field (left of the separator). Y is the number of decimals (right of the separator). The order of the disseminated tags is not fixed, i.e. the client may not make any assumptions of the order of tags. The only fixed component of a message is the message type, which is always placed first in the message data. Note that new messages and fields may be added in future versions of the protocol. To ensure forward compatibility, clients should ignore unrecognized message types and field tags.

Uwe · Accepted Answer

The data.table solution below parses the given .tip file and returns a data.table with tag and value pairs. So, this is probably a good starting point for further extracting the relevant data.

library(data.table)

# read downloaded file from local disk
tip_wide <- fread(
  "NOMX_Comm_Close2.tip"
  , sep = "
"
  , header = FALSE
)

# split tip messages into tag and value pairs
# thereby rehaping from wide to long format
# and adding a row number
tip_long <- tip_wide[, unlist(strsplit(V1, ";")), 
                     by = .(rn = seq_len(nrow(tip_wide)))]

# get message type tag as the first entry of each message
msg_type <- tip_long[, .(msg.type = first(V1)), by = rn]
# make message type a separate column for each tag-value-pair using join
# remove unnecessary rows
tip_result <- msg_type[long, on = "rn"][msg.type != V1]

# split tag and value pairs
tip_result[, c("tag", "value") := 
             data.table(stringr::str_split_fixed(V1, "(?<=^[A-Z]{0,9}[a-z])", 2))]

tip_result
#           rn msg.type        V1 tag   value
#     1:     1     BDSr        i2   i       2
#     2:     1     BDSr   NAmGITS NAm    GITS
#     3:     2      BDx      i106   i     106
#     4:     2      BDx      Si18  Si      18
#     5:     2      BDx        s2   s       2
#    ---                                     
#905132: 95622     BDCl        s2   s       2
#905133: 95622     BDCl  i2368992   i 2368992
#905134: 95622     BDCl Il2368596  Il 2368596
#905135: 95622     BDCl       Op1  Op       1
#905136: 95622     BDCl       Ra1  Ra       1

Note that the value column is of type character.

The regular expression "(?<=^[A-Z]{0,9}[a-z])" uses a look-behind assertion (see ?"stringi-search-regex") to define the split pattern. Note that {0,9} is used here instead of * as the look-behind pattern must not be unbounded (no * or + operators.)

Parsing a NASDAQ .tip file

Answers (1)

Related Questions