From: John Conover <john@email.johncon.com>
Subject: An approximation for data set size risk
Date: 9 Jun 2003 03:16:21 -0000
Let avg be the average of the marginal increments of a time series, and rms be the deviation. Then, from http://www.johncon.com/ntropix/, avg --- + 1 rms P = ------- 2 which is the likelihood of an up movement in the time series, and: G = (1 + rms)^P * (1 - rms)^(1 - P) where G is the gain in the time series. Note that rms - avg is the risk. (Also, note that the avg is not the average gain! G is.) Sanity check. Suppose we use the above formulas on a savings account. Then, avg = rms, and P = 1. G = 1 + rms = 1 + avg. Check, where avg = rms = the interest rate. When doing metrics on a time series, the measurement of rms converges very quickly; avg very slowly. In point of fact, we will probably never have a good idea what avg is for equities, for example, (for the daily closing median values of all equities on the US exchanges in the Twentieth Century, avg = 0.0004, rms = 0.02): tsshannoneffective 0.0004 0.02 12000 For P = (sqrt (avg) + 1) / 2: P = 0.510000 Peff = 0.501674 For P = (rms + 1) / 2: P = 0.510000 Peff = 0.509410 For P = (avg / rms + 1) / 2: P = 0.510000 Peff = 0.499942 which means that to even determine that the typical stock had an even likelihood of increasing on any day would require about 12,000 / 253 = 47 calendar years of daily trading data! What this means is that there is a significant probability that, by serendipity, the metrics were taken when the stock is in a "bubble", and we would be misled. Sanity check. What's the chances of a stock's value being in a "bubble" for at least 12,000 days? Its erf (1 / sqrt (12000)) which is about 1 / sqrt (12000) for t >> 1, or about 0.00912870929, and 1 - erf (1 / sqrt (12000)) = 0.990871291. Or we have a risk, (note that term again,) of about 1% on a data set size of 12,000 days. Note that 0.990871291 * 0.51 = 0.505344358, which is a very close approximation-about 1%-to the results given by the tsshannoneffective program, (which does things quite formally-its the same code that is in tsinvest.) So, it checks. But we know that the rms measurement settles quite quickly, and we can add the risk of the data set size being too small into the risk of the investment. Letting P' be P compensated for data set size: P' = P * (1 - erf (1 / sqrt (t))) which, as above, can be approximated by: P' = P * (1 - 1 / sqrt (t)) where P' is the investment risk, AND, the risk do to limited data set size, combined, and requires only avg an rms, measured over t many time intervals, where t >> 1; it can be used directly in the equation, above, for G. You can almost work it out in your head. John BTW, how many days does one have to measure a "typical" stock's performance to have a reasonable idea that it is capable of sustained growth? 0.5 = 0.51 * (1 - (1 / sqrt (t))) or t > 2601 trading days, or about 10 years of daily data. Using a data set size smaller than that, one will lose as much as one makes, (although you can make a lot before you lose it.) Now, consider a portfolio of ten of the same stocks, with equal investments maintained in each. avg (sqrt (10) * ---) + 1 rms P = --------------------- = 0.531622776 2 and: 0.5 = 0.531622776 * (1 - (1 / sqrt (t))) or t > 283 trading days, or a little more than a calendar year. You get the drift. -- John Conover, john@email.johncon.com, http://www.johncon.com/