Re: portfolio management

From: John Conover <john@email.johncon.com>
Subject: Re: portfolio management
Date: Sat, 19 Sep 1998 18:20:49 -0700
John Conover writes:
>
> BTW, the tsinvest program uses the algorithms out of the
> tsshannoneffective program to avoid making bad investment
> recommendations when searching the ticker for the best set of stocks
> for an optimal growth portfolio. The algorithms used are general, and
> can be included in any program. The sources are in
> http://www.johncon.com/ntropix/archive/tsinvest.tar.gz. Traditional
> statistical estimate is only one method that it uses-it turns out that
> statistical estimate is grossly optimistic with fractal data sets.
> For example, using only statistical estimation, one would expect to be
> able to time the market 51.9635% of the time-what would seem to be a
> very workable agenda, with a significant pay off. Not so, however. If
> one attempts such an agenda with only a 48.2783% likelihood of
> succeeding, one will win sometimes, but in the long run, loose the
> entire portfolio, (at a rate of 0.999406857 per day, on average.) The
> tsinvest program can be programmed to attempt to do market timing, and
> then simulations can be run using the NYSE historical data CDs of
> every stock in the NYSE since 1966. The simulations verify that the
> 48.3% number is, indeed, valid. (That's the main use of the tsinvest
> program-to simulate trading strategies, before its turned loose with
> live data.) The 48.2783% number is, also, fairly close to empirical
> market metrics from formal studies run in the mutual fund industry.
>

The tsshannoneffective is for calculating the effective Shannon
probability, given the average, root mean square, data set size, and
data set duration, of the normalized incre- ments of a time series.

Bottom line, it is for programmed trading (PT) of stocks. The C
sources are freely available as Open Source software on
http://www.johncon.com/ntropix/archive/tsinvest.tar.gz.

A fragment, specific to this discussion, of the manual page is
attached ...

        John

--

John Conover, john@email.johncon.com, http://www.johncon.com/

DESCRIPTION

DATA SET SIZE CONSIDERATIONS
   This  program  addresses the question "is there reasonable
   evidence to justify investment in an equity based on  data
   set size?"

   The Shannon probability of a time series is the likelihood
   that the value of the time series  will  increase  in  the
   next  time  interval.  The Shannon probability is measured
   using the average, avg, and root mean square, rms, of  the
   normalized increments of the time series. Using the rms to
   compute the Shannon probability, P:

           rms + 1
       P = ------- ....................................(1.1)
              2

   However, there is an error associated with the measurement
   of rms do to the size of the data set, N, (ie., the number
   of records in the time series,) used in the calculation of
   rms.  The confidence level, c, is the likelihood that this
   error is less than some error level, e.

   Over the many  time  intervals  represented  in  the  time
   series, the error will be greater than the error level, e,
   (1 - c) * 100 percent of the time-requiring that the Shan-
   non  probability, P, be reduced by a factor of c to accom-
   modate the measurement error:

            rms - e + 1
       Pc = ----------- ...............................(1.2)
                 2

   where the error level, e, and the confidence level, c, are
   calculated  using statistical estimates, and the product P
   times c is the effective Shannon probability  that  should
   be used in the calculation of optimal wagering strategies.

   The error, e, expressed in terms of the standard deviation
   of  the  measurement  error do to an insufficient data set
   size, esigma, is:

                 e
       esigma = --- sqrt (2N) .........................(1.3)
                rms

   where N is the data set size =  number  of  records.  From
   this,  the  confidence  level  can  be calculated from the
   cumulative sum, (ie., integration) of the normal distribu-
   tion, ie.:

       c     esigma
       -------------
       50     0.67
       68.27  1.00
       80     1.28
       90     1.64
       95     1.96
       95.45  2.00
       99     2.58
       99.73  3.00

   Note that the equation:

            rms - e + 1
       Pc = ----------- ...............................(1.4)
                 2

   will  require  an  iterated  solution since the cumulative
   normal distribution is  transcendental.  For  convenience,
   let  F(esigma)  be the function that given esigma, returns
   c, (ie., performs the table operation, above,) then:

                       rms - e + 1
       P * F(esigma) = -----------
                            2

                             rms * esigma
                       rms - ------------ + 1
                              sqrt (2N)
                     = ---------------------- .........(1.5)
                                 2

   Then:

                                   rms * esigma
                             rms - ------------ + 1
       rms + 1                      sqrt (2N)
       ------- * F(esigma) = ---------------------- ...(1.6)
          2                            2

   or:

                                     rms * esigma
       (rms + 1) * F(esigma) = rms - ------------ + 1 .(1.7)
                                      sqrt (2N)

   Letting a decision variable, decision,  be  the  iteration
   error created by this equation not being balanced:

                        rms * esigma
       decision = rms - ------------ + 1
                          sqrt (2N)

                   - (rms + 1) * F(esigma) ............(1.8)

   which can be iterated to find F(esigma), which is the con-
   fidence level, c.

   Note that from the equation:

            rms - e + 1
       Pc = -----------
                 2

   and solving for rms - e, the effective value of  rms  com-
   pensated  for accuracy of measurement by statistical esti-
   mation:

       rms - e = (2 * P * c) - 1 ......................(1.9)

   and substituting into the equation:

           rms + 1
       P = -------
              2

       rms - e = ((rms + 1) * c) - 1 .................(1.10)

   and defining the effective value of rms as rmseff:

       rmseff = rms - e ..............................(1.11)

   It can be seen that if optimality exists, ie., f = 2P - 1,
   or:

                2
       avg = rms  ....................................(1.12)

   or:
                      2
       avgeff = rmseff  ..............................(1.13)

   As  an example of this algorithm, if the Shannon probabil-
   ity, P, is 0.51, corresponding to an rms of 0.02, then the
   confidence  level,  c,  would  be  0.996298,  or the error
   level, e, would be 0.003776, for a data set  size,  N,  of
   100.

   Likewise, if P is 0.6, corresponding to an rms of 0.2 then
   the confidence level, c, would be 0.941584, or  the  error
   level, e, would be 0.070100, for a data set size of 10.

   Robustness  is  an  issue in algorithms that, potentially,
   operate real time. The traditional means of implementation
   of  statistical estimates is to use an integration process
   inside of a loop that calculates  the  cumulative  of  the
   normal  distribution,  controlled  by,  perhaps,  a Newton
   Method approximation using the derivative of cumulative of
   the  normal  distribution, ie., the formula for the normal
   distribution:

                                    2
                    1           - x   / 2
       f(x) = ------------- * e           ............(1.14)
              sqrt (2 * PI)

   Numerical stability and convergence issues are an issue in
   such processes.

   The Shannon probability of a time series is the likelihood
   that the value of the time series  will  increase  in  the
   next  time  interval.  The Shannon probability is measured
   using the average, avg, and root mean square, rms, of  the
   normalized increments of the time series. Using the avg to
   compute the Shannon probability, P:

           sqrt (avg) + 1
       P = -------------- ............................(1.15)
                 2

   However, there is an error associated with the measurement
   of avg do to the size of the data set, N, (ie., the number
   of records in the time series,) used in the calculation of
   avg.  The confidence level, c, is the likelihood that this
   error is less than some error level, e.

   Over the many  time  intervals  represented  in  the  time
   series, the error will be greater than the error level, e,
   (1 - c) * 100 percent of the time-requiring that the Shan-
   non  probability, P, be reduced by a factor of c to accom-
   modate the measurement error:

            sqrt (avg - e) + 1
       Pc = ------------------ .......................(1.16)
                    2

   where the error level, e, and the confidence level, c, are
   calculated  using statistical estimates, and the product P
   times c is the effective Shannon probability  that  should
   be used in the calculation of optimal wagering strategies.

   The error, e, expressed in terms of the standard deviation
   of  the  measurement  error do to an insufficient data set
   size, esigma, is:

                 e
       esigma = --- sqrt (N) .........................(1.17)
                rms

   where N is the data set size =  number  of  records.  From
   this,  the  confidence  level  can  be calculated from the
   cumulative sum, (ie., integration) of the normal distribu-
   tion, ie.:

       c     esigma
       -------------
       50     0.67
       68.27  1.00
       80     1.28
       90     1.64
       95     1.96
       95.45  2.00
       99     2.58
       99.73  3.00

   Note that the equation:

            sqrt (avg - e) + 1
       Pc = ------------------ .......................(1.18)
                    2

   will  require  an  iterated  solution since the cumulative
   normal distribution is  transcendental.  For  convenience,
   let  F(esigma)  be the function that given esigma, returns
   c, (ie., performs the table operation, above,) then:

                       sqrt (avg - e) + 1
       P * F(esigma) = ------------------
                               2

                                   rms * esigma
                       sqrt [avg - ------------] + 1
                                     sqrt (N)
                     = ----------------------------- .(1.19)
                                    2

   Then:

       sqrt (avg)  + 1
       --------------- * F(esigma) =
              2

                       rms * esigma
           sqrt [avg - ------------] + 1
                         sqrt (N)
           ----------------------------- .............(1.20)
                        2

   or:

       (sqrt (avg) + 1) * F(esigma) =

                       rms * esigma
           sqrt [avg - ------------] + 1 .............(1.21)
                         sqrt (N)

   Letting a decision variable, decision,  be  the  iteration
   error created by this equation not being balanced:

                               rms * esigma
       decision = sqrt [avg - ------------] + 1
                                 sqrt (N)

                  - (sqrt (avg) + 1) * F(esigma) .....(1.22)

   which can be iterated to find F(esigma), which is the con-
   fidence level, c.

   There are two radicals that  have  to  be  protected  from
   numerical floating point exceptions. The sqrt (avg) can be
   protected by requiring that avg >=  0,  (and  returning  a
   confidence  level  of  0.5,  or  possibly  zero,  in  this
   instance-a negative avg is not an interesting solution for
   the case at hand.)  The other radical:

                   rms * esigma
       sqrt [avg - ------------] .....................(1.23)
                     sqrt (N)

   and substituting:

                 e
       esigma = --- sqrt (N) .........................(1.24)
                rms

   which is:

                          e
                   rms * --- sqrt (N)
                         rms
       sqrt [avg - ------------------] ...............(1.25)
                     sqrt (N)

   and reducing:

       sqrt [avg - e] ................................(1.26)

   requiring that:

       avg >= e ......................................(1.27)

   Note  that  if  e  >  avg,  then Pc < 0.5, which is not an
   interesting solution for the  case  at  hand.  This  would
   require:

                 avg
       esigma <= --- sqrt (N) ........................(1.28)
                 rms

   Obviously,  the  search  algorithm must be prohibited from
   searching for a solution in this space. (ie., testing  for
   a solution in this space.)

   The  solution  is  to  limit  the search of the confidence
   array to values that are equal to or less than:

       avg
       --- sqrt (N) ..................................(1.29)
       rms

   which can be accomplished  by  setting  integer  variable,
   top, usually set to sigma_limit - 1, to this value.

   Note that from the equation:

            sqrt (avg - e) + 1
       Pc = ------------------
                    2

   and  solving  for avg - e, the effective value of avg com-
   pensated for accuracy of measurement by statistical  esti-
   mation:

                                  2
       avg - e = ((2 * P * c) - 1)  ..................(1.30)

   and substituting into the equation:

           sqrt (avg) + 1
       P = --------------
                 2

                                             2
       avg - e = (((sqrt (avg) + 1) * c) - 1)  .......(1.31)

   and defining the effective value of avg as avgeff:

       avgeff = avg - e ..............................(1.32)

   It can be seen that if optimality exists, ie., f = 2P - 1,
   or:

                2
       avg = rms  ....................................(1.33)

   or:

       rmseff = sqrt (avgeff) ........................(1.34)

   As an example of this algorithm, if the Shannon  probabil-
   ity, P, is 0.52, corresponding to an avg of 0.0016, and an
   rms of 0.04,  then  the  confidence  level,  c,  would  be
   0.987108,  or the error level, e, would be 0.000893, for a
   data set size, N, of 10000.

   Likewise, if P is 0.6, corresponding to an rms of 0.2, and
   an  avg  of  0.04,  then the confidence level, c, would be
   0.922759, or the error level, e, would be 0.028484, for  a
   data set size of 100.

   The Shannon probability of a time series is the likelihood
   that the value of the time series  will  increase  in  the
   next  time  interval.  The Shannon probability is measured
   using the average, avg, and root mean square, rms, of  the
   normalized  increments  of the time series. Using both the
   avg and the rms to compute the Shannon probability, P:

           avg
           --- + 1
           rms
       P = ------- ...................................(1.35)
              2

   However, there is an error associated with both  the  mea-
   surement of avg and rms do to the size of the data set, N,
   (ie., the number of records in the time series,)  used  in
   the  calculation  of avg and rms. The confidence level, c,
   is the likelihood that this error is less than some  error
   level, e.

   Over  the  many  time  intervals  represented  in the time
   series, the error will be greater than the error level, e,
   (1 - c) * 100 percent of the time-requiring that the Shan-
   non probability, P, be reduced by a factor of c to  accom-
   modate the measurement error:

                     avg - ea
                     -------- + 1
                     rms + er
       P * ca * cr = ------------ ....................(1.36)
                          2

   where  the  error level, ea, and the confidence level, ca,
   are calculated using statistical estimates, for  avg,  and
   the  error  level,  er,  and the confidence level, cr, are
   calculated using statistical estimates for  rms,  and  the
   product  P  * ca * cr is the effective Shannon probability
   that should be used in the calculation of optimal wagering
   strategies, (which is the product of the Shannon probabil-
   ity, P, times the superposition of the two confidence lev-
   els,  ca,  and cr, ie., P * ca * cr = Pc, eg., the assump-
   tion is made that the error in avg and the  error  in  rms
   are independent.)

   The  error,  er, expressed in terms of the standard devia-
   tion of the measurement error do to an  insufficient  data
   set size, esigmar, is:

                 er
       esigmar = --- sqrt (2N) .......................(1.37)
                 rms

   where  N  is  the  data set size = number of records. From
   this, the confidence level  can  be  calculated  from  the
   cumulative sum, (ie., integration) of the normal distribu-
   tion, ie.:

       cr     esigmar
       --------------
       50     0.67
       68.27  1.00
       80     1.28
       90     1.64
       95     1.96
       95.45  2.00
       99     2.58
       99.73  3.00

   Note that the equation:

                  avg
                -------- + 1
                rms + er
       P * cr = ------------ .........................(1.38)
                     2

   will require an iterated  solution  since  the  cumulative
   normal  distribution  is  transcendental. For convenience,
   let F(esigmar) be the function that given esigmar, returns
   cr, (ie., performs the table operation, above,) then:

                          avg
                        -------- + 1
                        rms + er
       P * F(esigmar) = ------------ =
                             2

                                avg
                        ------------------- + 1
                              esigmar * rms
                        rms + -------------
                                sqrt (2N)
                        ----------------------- ......(1.39)
                                   2

   Then:

       avg
       --- + 1
       rms
       ------- * F(esigmar) =
          2

                      avg
              ------------------- + 1
                    esigmar * rms
              rms + -------------
                      sqrt (2N)
              ----------------------- ................(1.40)
                         2

   or:

        avg
       (--- + 1) * F(esigmar) =
        rms

                      avg
              ------------------- + 1 ................(1.41)
                    esigmar * rms
              rms + -------------
                      sqrt (2N)

   Letting  a  decision  variable, decision, be the iteration
   error created by this equation not being balanced:

                          avg
       decision =  ------------------- + 1
                         esigmar * rms
                   rms + -------------
                          sqrt (2N)

                      avg
                   - (--- + 1) * F(esigmar) ..........(1.42)
                      rms

   which can be iterated to find  F(esigmar),  which  is  the
   confidence level, cr.

   The  error,  ea, expressed in terms of the standard devia-
   tion of the measurement error do to an  insufficient  data
   set size, esigmaa, is:

                 ea
       esigmaa = --- sqrt (N) ........................(1.43)
                 rms

   where  N  is  the  data set size = number of records. From
   this, the confidence level  can  be  calculated  from  the
   cumulative sum, (ie., integration) of the normal distribu-
   tion, ie.:

       ca     esigmaa
       --------------
       50     0.67
       68.27  1.00
       80     1.28
       90     1.64
       95     1.96
       95.45  2.00
       99     2.58
       99.73  3.00

   Note that the equation:

                avg - ea
                -------- + 1
                  rms
       P * ca = ------------ .........................(1.44)
                     2

   will require an iterated  solution  since  the  cumulative
   normal  distribution  is  transcendental. For convenience,
   let F(esigmaa) be the function that given esigmaa, returns
   ca, (ie., performs the table operation, above,) then:

                        avg - ea
                        -------- + 1
                          rms
       P * F(esigmaa) = ------------ =
                             2

                              esigmaa * rms
                        avg - -------------
                                sqrt (N)
                        ------------------- + 1
                                  rms
                        ----------------------- ......(1.45)
                                   2 Then:

       avg
       --- + 1
       rms
       ------- * F(esigmaa) =
          2

                    esigmaa * rms
              avg - -------------
                      sqrt (N)
              ------------------- + 1
                        rms
              ----------------------- ................(1.46)
                         2

   or:

        avg
       (--- + 1) * F(esigmaa) =
        rms

                    esigmaa * rms
              avg - -------------
                      sqrt (N)
              ------------------- + 1 ................(1.47)
                        rms

   Letting  a  decision  variable, decision, be the iteration
   error created by this equation not being balanced:

                        esigmaa * rms
                  avg - -------------
                          sqrt (N)
       decision = ------------------- + 1
                            rms

              avg
           - (--- + 1) * F(esigmaa) ..................(1.48)
              rms

   which can be iterated to find  F(esigmaa),  which  is  the
   confidence level, ca.

   Note that from the equation:

                  avg
                -------- + 1
                rms + er
       P * cr = ------------
                     2

   and  solving for rms + er, the effective value of rms com-
   pensated for accuracy of measurement by statistical  esti-
   mation:

                        avg
       rms + er = ---------------- ...................(1.49)
                  (2 * P * cr) - 1

   and substituting into the equation:

           avg
           --- + 1
           rms
       P = -------
              2

                          avg
       rms + er = -------------------- ...............(1.50)
                    avg
                  ((--- + 1) * cr) - 1
                    rms

   and defining the effective value of avg as rmseff:

       rmseff = rms +/- er ...........................(1.51)

   Note that from the equation:

                avg - ea
                -------- + 1
                  rms
       P * ca = ------------
                     2

   and  solving for avg - ea, the effective value of avg com-
   pensated for accuracy of measurement by statistical  esti-
   mation:

       avg - ea = ((2 * P * ca) - 1) * rms ...........(1.52)

   and substituting into the equation:

           avg
           --- + 1
           rms
       P = -------
              2

                     avg
       avg - ea = (((--- + 1) * ca) - 1) * rms .......(1.53)
                     rms

   and defining the effective value of avg as avgeff:

       avgeff = avg - ea .............................(1.54)

   As  an example of this algorithm, if the Shannon probabil-
   ity, P, is 0.51, corresponding to an rms of 0.02, then the
   confidence level, c, would be 0.983847, or the error level
   in avg, ea, would be 0.000306, and the error level in rms,
   er, would be 0.001254, for a data set size, N, of 20000.

   Likewise, if P is 0.6, corresponding to an rms of 0.2 then
   the confidence level, c, would be 0.947154, or  the  error
   level  in  avg, ea, would be 0.010750, and the error level
   in rms, er, would be 0.010644, for a data set size of  10.

   As  a  final discussion to this section, consider the time
   series for an equity. Suppose that the data  set  size  is
   finite,  and avg and rms have both been measured, and have
   been found to both be positive. The question that needs to
   be  resolved  concerns  the  confidence, not only in these
   measurements, but the actual  process  that  produced  the
   time  series.  For example, suppose, although there was no
   knowledge of the fact, that the time series  was  actually
   produced  by  a  Brownian motion fractal mechanism, with a
   Shannon probability of exactly  0.5.  We  would  expect  a
   "growth" phenomena for extended time intervals [Sch91, pp.
   152], in the time series, (in  point  of  fact,  we  would
   expect  the  cumulative distribution of the length of such
   intervals to be proportional to erf (1 / sqrt (t)).)  Note
   that,  inadvertently, such a time series would potentially
   justify investment. What the methodology outlined in  this
   section  does is to preclude such scenarios by effectively
   lowering  the  Shannon  probability  to  accommodate  such
   issues. In such scenarios, the lowered Shannon probability
   will cause data sets with larger sizes  to  be  "favored,"
   unless  the  avg  and  rms  of a smaller data set size are
   "strong" enough in relation to the  Shannon  probabilities
   of the other equities in the market. Note that if the data
   set sizes of all equities in the market  are  small,  none
   will  be  favored,  since they would all be lowered by the
   same amount, (if they were all statistically similar.)

   To reiterate, in the equation avg = rms * (2P  -  1),  the
   Shannon  probability, P, can be compensated by the size of
   the data set, ie., Peff, and used in the equation avgeff =
   rms  * (2Peff - 1), where rms is the measured value of the
   root mean square of the normalized increments, and  avgeff
   is  the effective, or compensated value, of the average of
   the normalized increments.

DATA SET DURATION CONSIDERATIONS
   An additional accuracy issue, besides data  set  size,  is
   the  time interval over which the data was obtained. There
   is some possibility that the data set was taken during  an
   extended  run length, either negative or positive, and the
   Shannon probability will have to be compensated to  accom-
   modate  this  measurement  error.  The  chances that a run
   length will exceed time, t, is:

       1 - erf (1 / sqrt (t)) ........................(1.55)

   or the Shannon probability, P, will have to be compensated
   by a factor of:

       erf (1 / sqrt (t)) ............................(1.56)

   giving a compensated Shannon probability, Pcomp:

       Pcomp = Peff * (1 - erf (1 / sqrt (t)))........(1.57)

   Fortunately,  since  confidence levels are calculated from
   the normal probability function,  the  same  lookup  table
   used for confidence calculations (ie., the cumulative of a
   normal distribution,) can be used to calculate the associ-
   ated error function.
   To  use  the  value  of the normal probability function to
   calculate the error function, erf (N), proceed as follows;
   since  erf  (X  /  sqrt (2)) represents the error function
   associated with the normal curve:

       A) X = N * sqrt (2).

       B) Lookup the the value of X in the normal probability
   function.

       C) Subtract 0.5 from this value.

       D) And, multiply by 2.

   or:

       erf (N) = 2 * (normal (t * sqrt (2)) - 0.5) ...(1.58)
Last modified: Fri Mar 26 18:54:02 PST 1999 $Id: 980919182130.4311.html,v 1.0 2001/11/17 23:05:50 conover Exp $