Some time of past year I spent in attempts to convince people did not rely on FIX logs especially if they collected from different JVMs or even hosts, in their attempts to measure latency on FIX message exchange. That may sounds silly, but finally I gave up and made some scripts that helps to automate process. Unexpectedly (for people, not for me) results were controversial and unpredictable. But this post is not about this. I decided pay tribute to a wonderful utility without which I just can't imagine how would I cope with that task. Let me write some warm words about AWK. It really outstanding one beside other data driven script engines due its power and flexibility.
The FIX protocol is used for Financial Information eXchange in form of short messages. Every financial institution armed with FIX as far as they doing electronic trading. There are many kind of messages but most frequent are so called quote messages. Or roughly speaking - prices. Obviously, everybody keen to exchange them as fast as it is possible. And consequently it is extremely important to demonstrate that they really are "flying". Easiest way for that is the log that dumped each time when message have been received or sent. That log, effectively, is a copy of message and contains at least one timestamp that may give you an idea when that message was created. I said at least one, because usually, each log entry also timestamped immediately before message is sent or after it was received. It looks like that:
20100101-15:46:19.106: <FIX message>
The FIX message is, in fact, key-value pairs delimited by <SOH>, special character (in hex 0x01). And yes, we have to deal with usual ASCII text. The task looks pretty simple is not it? So we have a bunch of log files and we have to extract required information from them for further processing. So here we go with first AWK action:
BEGIN { FS="\1" # set <SOH> as field separator }
First problem emerges when you look what exactly you have to extract. As were mentioned above we are interested in timestamp of creation each price and if we look at log entry we may realize that each message may carry multiple prices.
I said may, because there are few kind of quote messages. Without digging too much into deep abyss of FIX, just note that they could be identified by special key-value pair (or "field", in FIX terminology) with key - "35". And some of them (i.e. quote messages) may include multiple entries that represents price. Each of those entry, in turn, has multiple fields that carry price attributes like volume, identifier and so on. So far, we may add more actions:
$0 ~ /(\1)35=W(\1)/ { # that's Market Data - Snapshot/Full Refresh (W) sub(/: .+/, "", $1) # extract log entry timestamp for (i = 2; i <= NF; i++) { if ($i ~ /299=/) { # on each QuoteEntryID(299) print out quote ID and timestamp printf("%s, %s\n", substr($i, 5), $1) } } }
In similar way we can process all required quote messages. Simple, is not it? :)
Below is a bit more complex script that consider also SendingTime(52) and can handle additionally Market Data - Incremental Refresh (X) messages.
#!/bin/awk -f BEGIN { FS="\1" } $0 ~ /(\1)35=W(\1)/ { sub(/: .+/, "", $1) for (i = 2; i <= NF; i++) { if ($i ~ /52=/) { time = toTick(substr($i, 4)) } if ($i ~ /299=/) { printf("%s, %s, %s\n", substr($i, 5), time, toTick($1)) } } } $0 ~ /(\1)35=X(\1)/ { sub(/: .+/, "", $1) for (i = 2; i <= NF; i++) { if ($i ~ /52=/) { time = toTick(substr($i, 4)) } if ($i ~ /282=/) { printf("%s, %s, %s\n", substr($i, 5), time, toTick($1)) } } } function toTick(stamp) { return stamp; }
Empty function "toTick" may be responsible for timestamp conversion from human readable format into more useful for processing - UNIX file time, or epoch.
## # Expected timestamp in form "20100806-09:09:51.348" # Required GAWK function toTick(stamp) { epoch = mktime(substr(stamp, 1, 4) " " substr(stamp, 5, 2) " " substr(stamp, 7, 2) " " substr(stamp, 10, 2) " " substr(stamp, 13, 2) " " substr(stamp, 16, 2)) return (epoch + substr(stamp, 18)) * 1000 }
No comments:
Post a Comment