Reputation: 87
Problem: I have a .tip file from NASDAQ that I need to parse. Official name: GENIUM CONSOLIDATED FEED
The file i a csv like file with semicolons and newline for new entries of different structure, and so no constant header. But it has a corresponding xsd schemafile which should describe the contents and structure, but I can see no clear way to go from the file to a structure result. Have tried with a list setup where messageType becomes a name in a list
x <- scan("cran_tasks/NOMX_Comm_Close2.tip", what="", sep="\n")
y <- strsplit(x, ';')
names(y) <- sapply(y, `[[`, 1)
y <- sapply(y, `[`, -1, simplify = FALSE)
y <- sapply(y, as.list)
The file is structured like this:
messageType;key1Value;key2Value;...;..;/n
messageType;key1Value;key2Value;.....;/n
BDSr;i2;NAmGITS;
BDx;i106;Si18;s2;SYmNC;NAmNASDAQ OMX Commodities;CNyNO;MIcNORX;
BDm;i672;Si018171;s2;Ex106;NAmFuel Oil;SYmNCFO;TOTa+0200;LDa20141011;
BDIs;i10142;SiNP;s2;ISsNP;NAmNord Pool ASA;
m;i122745;t191500.001;Dt20170509;ISOcY;ISOtY;
m;i122745;t192808.721;Dt20170509;ISOcN;ISOtY;SEp275.45;
Oi;i122745;t054425.600;OPi2840;
I have had a working sql code set to parse the file but it have shown to be to case specific to be robust against even minor changes in structure, like the order of the different keyValue pairs. So I'm looking for at way to exploit the the structure of the information to be able to make a robust and maintainable solution, preferably in R. I have tried with some regular expressions matching but still I end up with a lot of context specific code, so I hope the some structuring with a table or dataframe containing the Key information can make for a sustainable solution.
Any hints or suggestions are more than welcome.
link to the XML/XSD file and the html sheet specifying keys, and a .tip file
TIP Message Format The TIP protocol is a tagged text protocol. A TIP message is a sequence of tag and value pairs separated with semicolon. A tag is zero or more UPPERCASE characters followed by a lowercase character. The tag is followed immediately by the value. Examples of tags are "FLd", "STa". The first tag in a message is always the message type. The message type tag has no value. An example of a message type tag is "BDSh". IP messages are encoded with UTF-8 unless stated otherwise. The maximum length of a TIP message is indicated with the constant MAX_MESSAGE_LENGTH (2048 bytes). Any max field length excludes any escape characters '\'. No empty values will be sent; exceptions are message type tags and Boolean tags (the presence of the tag itself corresponds to a 'true' value). For a decimal field (i.e. the Float data type) the length is given as X,Y where X is the max number of digits in the integer part of the field (left of the separator). Y is the number of decimals (right of the separator). The order of the disseminated tags is not fixed, i.e. the client may not make any assumptions of the order of tags. The only fixed component of a message is the message type, which is always placed first in the message data. Note that new messages and fields may be added in future versions of the protocol. To ensure forward compatibility, clients should ignore unrecognized message types and field tags.
Upvotes: 0
Views: 526
Reputation: 42572
The data.table
solution below parses the given .tip file and returns a data.table with tag and value pairs. So, this is probably a good starting point for further extracting the relevant data.
library(data.table)
# read downloaded file from local disk
tip_wide <- fread(
"NOMX_Comm_Close2.tip"
, sep = "\n"
, header = FALSE
)
# split tip messages into tag and value pairs
# thereby rehaping from wide to long format
# and adding a row number
tip_long <- tip_wide[, unlist(strsplit(V1, ";")),
by = .(rn = seq_len(nrow(tip_wide)))]
# get message type tag as the first entry of each message
msg_type <- tip_long[, .(msg.type = first(V1)), by = rn]
# make message type a separate column for each tag-value-pair using join
# remove unnecessary rows
tip_result <- msg_type[long, on = "rn"][msg.type != V1]
# split tag and value pairs
tip_result[, c("tag", "value") :=
data.table(stringr::str_split_fixed(V1, "(?<=^[A-Z]{0,9}[a-z])", 2))]
tip_result
# rn msg.type V1 tag value
# 1: 1 BDSr i2 i 2
# 2: 1 BDSr NAmGITS NAm GITS
# 3: 2 BDx i106 i 106
# 4: 2 BDx Si18 Si 18
# 5: 2 BDx s2 s 2
# ---
#905132: 95622 BDCl s2 s 2
#905133: 95622 BDCl i2368992 i 2368992
#905134: 95622 BDCl Il2368596 Il 2368596
#905135: 95622 BDCl Op1 Op 1
#905136: 95622 BDCl Ra1 Ra 1
Note that the value
column is of type character.
The regular expression "(?<=^[A-Z]{0,9}[a-z])"
uses a look-behind assertion (see ?"stringi-search-regex"
) to define the split pattern. Note that {0,9}
is used here instead of *
as the look-behind pattern must not be unbounded (no * or + operators.)
Upvotes: 1