From: John Conover <john@email.johncon.com>
Subject: Data Set Sizes
Date: Mon, 8 Aug 2005 21:06:15 -0700
I'm getting a lot of questions on data set sizes required for equity investments. After the US dot-bomb "bubble" burst in 2000, one would think that we would have all learned not to buy into "bubbles," with decisions based on small data set sizes taken inside the "bubble." But the way things are shaping up, 2009 in the Chinese technology markets, it will be the sequel to the US markets of 2000. The way tsinvest calculates the uncertainty, or risk, do to inadequate data set sizes is quite complicated, (its controlled with the -c and -C command line options,) but you can approximate it quite accurately-almost in your head. Consider a political poll-we want to know how much of the population will vote for candidate X; we wish to determine that by sampling the population, so there will be a small error do to the sampling. For that we would use "Statistical Estimate," to determine the "margin of error," which is twice the standard deviation, and is 1 / sqrt (n), where n is the number of opinions sampled. Two (double sided,) standard deviations are 0.954499736103641736, and it is, also, called the "95% confidence level". (All of this comes from Bernoulli P-Trials, in case you want to pursue the subject further.) Suppose the sample size, n, is 1000, (a typical number in political polls.) Then the margin of error would be 1 / sqrt (n) = 0.0316, or 3%. What that means is that if we repeat like polls, (i.e., of a thousand samples, each,) a hundred times, 5 of the times, the real value would be more than +/- 3% of the sampled value. That's where the "Candidate X is favored by 60% over Candidate Y, with a margin of error of 3%" comes from on the 6 oclock talking head shows. The same thing can be used in equity selection, too. The probability of an up movement, P, in an equity's value is: P = ((avg / rms) + 1) / 2 (See: http://www.johncon.com/john/correspondence/020213233852.26478.html for particulars,) where avg is the average, and, rms is the root-mean-square of the marginal increments of the equity's price. The margin of error in measuring rms is rms / n, for n many samples in the time series-and so is the error in avg, i.e., the rms converges to a fair accuracy, very quickly, and the error in avg does not-e.g., most of the error in measuring P is in the measurment of avg, since avg << rms, and they both have the same magnitude of error. So, as a fairly accurate approximation, instead of using the measured avg, we reduce avg by an amount rms / sqrt (n), to calculate an "effective P," P', and then use that in the equation: G = ((1 + rms)^P') * ((1 - rms)^(1 - P')) to select which equities have the best proforma. Note that: P' = (((avg / rms) - (1 / sqrt (n))) + 1) / 2 so really, all that happened was to subtract 1 / sqrt (n) from avg / rms. But what does that have to do with "bubbles" in equity prices? Looking at it from a different angle, the chances of a zero free interval of at least n many time units in a Brownian motion fractal is erf (1 / sqrt (n)), which for n >> 1 is approximately 1 / sqrt (n); in other words, the chances, (or probability,) of a "bubble" lasting through an entire measurement of n many samples, (and giving misleading information about P do to unfortunately making the measurement in a "bubble,") is about 1 / sqrt (n), which, (since its an uncertainty,) could be subtracted from P to obtain P', as above, (the previous calculation ignored data size effects in rms, so that error term is 1 / 2 the error term of the latter.) Where does the error term come from? Its the error in the market's assessment of a fair value for an equity. A "bubble" is the assessment process-the process of determining a fair market value. An example: a typical equity's avg is 0.0004, and rms is 0.02, measured on daily closes for the US equity markets. How many days, minimum, must be included in the time series for an analysis? Obviously, avg > 0 for P' > 0.5, so rms / sqrt (n) < avg, or sqrt (n) = 0.02 / 0.0004 = 50, or n = 50^2 = 2500, or about 10 years of 253 trading days per year, (half that if you prefer the first method.) This would insure that one was not buying into a "bubble." However, note that for better performing equities, (the one used in the example, above, only does about 5-10% gain a year, which is typical for all equities in the US markets,) that have larger avg values, the data set size requirements are much smaller. John -- John Conover, john@email.johncon.com, http://www.johncon.com/