## Abstract

The optimal method for determining the number of latent factors in a dataset is an unresolved problem in explanatory factor analysis. This article uses several of the most commonly cited methods to determine the number of relevant factors in developed equity markets, finding that there are typically between 10 and 20. The results of these tests are evaluated against the optimal number of factors for estimating realized correlations. The author concludes that the information criteria and random matrix theory approaches provide the best results. Notwithstanding these results, they find that filtered correlation matrixes provide only a marginal advantage over the sample correlation matrix when estimating correlations. The study then examines economic interpretations of the latent factors, which adds context to the evaluation of the number. Finally, the author compares the efficacy of the modeled correlations in an important real-world application: minimum variance portfolio construction.

**Key Findings**

▪ There are 10–20 factors in developed equity markets: Factor 1 is the market factor, factors 2–4 are regional factors, and factors 5–20 are primarily geographic factors with a few industry-oriented factors also present.

▪ Random matrix theory and information criteria provide accurate ex ante estimates of the number of latent factors in a dataset.

▪ Filtered correlation matrixes do not provide a significant advantage over the sample correlation matrix in estimating future correlations.

Estimating the correlation between assets is important in a series of financial applications ranging from asset allocation and portfolio construction to risk management and derivative pricing. The sample correlation matrix is often used to estimate correlations for these purposes and is known to be an unbiased estimator of the true correlations. However, it is prone to contain noise that is not relevant out of sample. This problem is exacerbated when the number of assets *N* is greater than the number of observations *T* (Bai and Shi 2011). Various methods have been proposed to address this issue, with the most common being dimensionality reduction and shrinkage (Ledoit and Wolf 2003). The focus of this article is the dimensionality reduction or factor model approach.

Factors are a common theme in finance and are often used in the context of expected returns. This practice has its origins in Sharpe’s (1963) capital asset pricing model (CAPM), which used the market as its sole factor. The arbitrage pricing theory (APT) of Ross (1976) expanded this model to cover multiple risk premiums or factors. Assuming *k* factors, this model can be generalized to

where *r* is an *N* × 1 vector of stock returns, Λ is an *N* × *k* matrix of factor exposures (betas), *f* is a *k* × 1 vector of factor returns, and ε is an *N *× 1 vector of idiosyncratic returns. Much of the subsequent research into factors has focused on finding observable stock characteristics that carry a risk premium, with notable examples being the size and value factors of Fama and French (1993), the momentum factor of Jegadeesh and Titman (1993), and the quality factor of Asness, Frazzini, and Pedersen (2013).

Although the rationale behind the discovery of these factors was to explain the cross-section of stock returns, they are often used to model covariance. By assuming the factors and idiosyncratic returns in Equation 1 are uncorrelated, the stock covariance matrix Σ can be modeled as

2where K is the *k* × *k* factor covariance matrix and Ψ is an *N* × *N* diagonal matrix of stock-specific variances (diagonal because the idiosyncratic returns are assumed to be mutually uncorrelated), resulting in a reduction in the dimension of the correlation estimation problem from *N* to *k*.

Observable factor models using this approach often combine the aforementioned style factors with industry, country, and currency variables, resulting in up to a hundred factors for a global equity market model (e.g., MSCI BARRA Global Equity Model).

An alternative approach to factor modeling is to derive statistical or latent factors directly from the stock covariance (or correlation) matrix through principal component analysis (PCA). Chamberlain and Rothschild (1983) showed that if *k *eigenvalues of the stock covariance matrix increase as the number of assets *N* increases, then there are *k* factors, and the corresponding *k* eigenvectors can be used as factor sensitives. Connor and Korajczyk (1988) showed that the same is true for the stock correlation matrix.

With a latent factor model, both Λ and *f* in Equation 1 are unobservable. The PCA approach sequentially finds the set of orthogonal latent factors that explains the most variance in the data. This is equivalent to solving the following optimization problem:

subject to the constraint , where the constraint ensures orthogonality. PCA can be used to derive a model in the form of Equation 1 by setting Λ as *N*^{½} times the matrix of eigenvectors corresponding to the *k* largest eigenvalues and *f* as *r* × Λ × *N*^{−1}.

An important assumption underlying latent factor models is that all relevant information is contained within stock returns; hence, industry and country mappings and the fundamental information used in the construction of style factors provide no benefit to the modeling of correlations. Therefore, one appeal of a latent factor model is that it reduces model risk because minimal decisions are left to practitioners and only return data are required in its construction. Another advantage is that it is parsimonious—as we demonstrate, a developed equity market latent factor model will typically consist of 10 to 20 factors, which (as noted earlier) is considerably fewer than an equivalent observable factor model. This parsimony is a natural result of the factor orthogonality. In contrast, observable factor models offer greater interpretability to users but may contain factors that are irrelevant or highly correlated with each other.

A key consideration when estimating a latent factor model is how many factors to retain. The eigenvalues of a stock covariance or correlation matrix typically exhibit a spiked distribution, whereby a few significant components explain a large portion of the variance in the data. However, the division between components representing genuine factors and those representing idiosyncratic effects is not obvious.

Exhibit 1 displays the 50 largest eigenvalues of a correlation matrix of developed market daily stock returns from March 2017 to March 2019; the spiked distribution is clear.

The problem of how to determine the number of significant components in a dataset has received considerable attention in the literature, and a wide variety of methods have been proposed. This study compares several commonly cited tests and assesses their ability to estimate equity market correlations. Establishing an accurate ex ante test of the number of latent factors in a dataset is important for correlation modeling but also has broader relevance: Determining the number of truly independent risk factors in equity markets has important implications for portfolio diversification.

This article makes three main contributions: First, we provide evidence that the number of latent factors in developed equity market returns is between 10 and 20; second, we detail which latent factor quantification tests provide accurate ex ante estimates; and third, we provide economic interpretations of the latent factors, giving insight into how they relate to portfolio construction.

The remainder of this article is structured as follows. First, we review the factor quantification tests proposed in the literature. Next, we outline the data and notation used for our testing. We then examine the optimal number of factors for estimating developed equity market correlations. The next section evaluates the accuracy of the factor quantification tests outlined in the literature. After this, we discuss possible economic interpretations of the PCA factors. In the penultimate section, we apply correlation estimates to the construction of minimum variance portfolios. Finally, we present our conclusions.

## FACTOR RETENTION LITERATURE OVERVIEW

The problem of how to determine the number of relevant factors in a dataset is approached from several perspectives in the literature. This section provides a curated summary and is organized into three groups: information criteria, random matrix theory (RMT), and eigenvalue divergence tests.

### Information Criteria

Finding the number of factors to retain may be considered a model selection problem. Consider a set of candidate latent factor models *M*_{1}, …, *M _{k}*, …,

*M*, which are defined by the number of factors they contain; if there are

_{N}*k*factors in the dataset, then model

*M*will be the preferred model. Two commonly used model selection criteria are the Akaike information criterion (AIC) (Akaike 1973) and the Bayesian information criterion (BIC) (Schwarz 1978), as follows:

_{k}where is the maximized value of a likelihood function of the model. Given a set of candidate models, the one with the lowest AIC or BIC is chosen. Fujikoshi, Ulyanov, and Shimizu (2010) and Fujikoshi and Sakurai (2015) showed that when applied to PCA factor identification, these two criteria can be calculated as functions of the number of variables *N*, the number of observations *T*, and the eigenvalues λ. Typically, the AIC and BIC are used where *T* > *N*; however, Bai, Fujikoshi, and Choi (2018) showed that the AIC and BIC can be applied to the higher-dimension case where *N *> *T*, although the number of factors *k* must still be fewer than *T* − 2. Bai, Fujikoshi, and Choi derived modified criteria for this purpose, shown in the Appendix in Equations A1 and A2. These tests may be applied to either the covariance or correlation matrixes.

Another commonly cited set of tests is the panel criteria proposed by Bai and Ng (2002), which are applied to the covariance matrix. The criteria are of the form

where IC is the criterion to be minimized, *k* is the number of factors being tested, *V*(*k*) is the sum of the squared idiosyncratic returns in Equation 1, and *g*(*N*,*T*) is a penalty function that encourages the use of fewer factors. The inclusion of additional factors will always result in smaller in-sample squared idiosyncratic returns; hence, the penalty function is deigned to counteract overfitting. Bai and Ng proposed six penalty functions, three of which (PC_{1}, PC_{2}, and PC_{3}) depend on a prespecified maximum number of factors *k _{max}* that is known with certainty to be larger than the true number.

*k*is used to derive , which is the sum of the squared idiosyncratic returns from the model with

_{max}*k*factors. The remaining three penalty functions (IC

_{max}_{1}, IC

_{2}, and IC

_{3}) are simply functions of

*N*and

*T*. The penalty functions are shown in the Appendix in Equations A3 to A8. Given a set of candidate models, the one with the lowest PC or IC test statistic is chosen.

### Random Matrix Theory

Another approach to the factor identification problem uses the ideas of RMT. Marchenko and Pastur (1967) showed that the eigenvalues of a correlation matrix of random data will adhere to a Marchenko–Pastur distribution, whereby the eigenvalues remain within set bounds (see Equation A9 in the Appendix). Therefore, eigenvalues outside of the Marchenko–Pastur bounds can be assumed to be associated with factors, whereas those within the bounds adhere to the characteristics of random data and are hence noise. Similarly, the eigenvectors of random data will follow a normal distribution, whereas the eigenvectors corresponding to real factors will diverge. Plerou et al. (2000) applied this theory to stock returns and confirmed that 98% of the eigenvalues conform to the characteristics of RMT and that the 2% that diverge are the market factor and some sector effects.

Horn’s (1965) parallel analysis uses Monte Carlo–simulated random data of the same size (*T* × *N*) as the true dataset to find the random eigenvalue distribution (and its limits) rather than assuming that the Marchenko–Pastur distribution holds. Buja and Eyuboglu (1992) discussed parallel analysis best practices, such as taking a confidence interval (95%) of the random data distribution rather than the mean or median. Buja and Eyuboglu also suggested using random permutations of the real data rather than simulated data because this allows the original distribution to be retained. However, they demonstrated that this makes little difference in practice, even when the real data are far from normally distributed.

### Eigenvalue Divergence Tests

Another avenue of research uses the fact that as *N* increases, the eigenvalues associated with factors will increase, whereas the eigenvalues of the idiosyncratic components will remain bounded. Therefore, factors can be identified by observing the divergence in eigenvalues.

Cattell (1966) proposed a scree plot, whereby eigenvalues are plotted in descending order and the cut-off between factors and idiosyncratic components is determined at the point at which the eigenvalues level off. The scree plot is often criticized for its subjectivity and dependence on the scaling of the axes, which can alter the conclusions if changed.

Assuming the true number of factors is *k*, the difference between any eigenvalue from 1 to *k* and those from *k *+ 1 onward should increase toward infinity with *N*, whereas the difference between any pair of eigenvalues beyond *k *+ 1 will remain bounded. Kapetanios (2004) proposed the use of subsampling to test this thesis with a finite *N*. The test involves taking the differences between eigenvalues of subsamples of the original data: The difference between subsample eigenvalues 1 to *k _{max}* and

*k*+ 1 is observed (using subsamples of various sizes), and

_{max}*k*determined where the differences increase in proportion to the subsample size.

Onatski (2005), building on the work of Wachter (1976) and Silverstein and Combettes (1992), used the distribution of the idiosyncratic eigenvalues to quantify the number of factors in the data. Noting that the largest idiosyncratic eigenvalues cluster together around the leveling off point, Onatski proposed a factor estimator that uses the differences between consecutive eigenvalues to find the cutoff. The process is iterative, although Onatski found that typically only one to two iterations are required before convergence. The algorithm is detailed in the Appendix.

Ahn and Horenstein (2013) proposed taking the ratio, rather than the difference, between consecutive eigenvalues to find the cutoff between factors and idiosyncratic components. They proposed two tests, the eigenvalue ratio (ER) and growth ratio (GR), shown in Appendix Equations A10 and A11. They suggested that the number of factors is determined where the ER or GR is highest.

## DATA AND NOTATION

Our tests are conducted on the constituents of the FTSE Developed Index, which is an index of the listed companies representing the top 90% (approximately) of developed equity market capitalization, subject to liquidity and investability criteria. Full details of the FTSE Global Equity Index Series (GEIS) country classifications and eligibility criteria are covered in the GEIS ground rules (FTSE 2021).

The analysis period ranges from September 2000 to March 2020; factor tests are conducted at semiannual intervals, using the index constituents of the Monday following the third Friday of March and September (the index rebalance dates) and using returns up to and including the third Friday of the month in question.

Each test is repeated using one-, two-, three-, four- and five-year data windows of daily total returns in US dollars. FTSE developed constituents with fewer than 180*x* daily returns are removed, where *x* is the number of years in the data window. After this screen, stocks with fewer than 150*x* coincident returns with other stocks are removed.

Exhibit A1 in the appendix shows which countries are included in the FTSE Developed Index (and hence in our tests) over the analysis period, and Exhibit A2 displays the number of stocks and the number of daily returns used at each test date. The number of assets *N* is greater than the number of observations *T* for all data windows.

In the remainder of this article, the term *factors* is used to mean *latent factors* and is interchangeable with the terms *statistical factors* and *significant components*. Throughout, we assume a system with *N* assets, each of which has *T* observations. Σ is an *N* × *N* asset covariance matrix, and Ω is the corresponding *N* × *N* correlation matrix.

λ_{Σ} is a set of eigenvalues, the subscript denoting whether they are derived from the covariance or correlation matrix. Where no subscript is specified, the eigenvalues of either matrix could be used for the purpose at hand. Eigenvalues discussed in this article are enumerated and selected in descending order. The same ordering applies to the corresponding factors.

Some of the literature on factor retention problems outlined earlier uses a predetermined constant that serves as the upper bound of the number of factors to be retained; we notate this as *k _{max}*. When referring to estimated correlation matrixes, we use the terms

*filtered*and

*modeled*interchangeably because it is understood that this article is exclusively focused on correlation modeling using latent factor models.

We use the square root of the mean squared estimated-to-realized correlation error (*RMSE*) to compare the accuracy of correlation estimates. This is calculated as

where Ω_{E} is an estimated correlation matrix, and Ω_{R} is the corresponding realized correlation matrix. The square root of the mean squared error is in units of correlation and is therefore readily interpreted.

As noted in the introduction, the sample correlation matrix contains noise that is not relevant out of sample. This means that the RMSE between correlation forecasts and the sample correlations from the following period (the realized correlations) will be biased upward by noise that is only relevant to the realized period. Nevertheless, on average the noise should influence all models similarly and hence not invalidate repeated tests comparing between models—it is the relative levels of RMSE we are concerned with, not the absolute levels. In addition, we are interested in correlation estimation for practical purposes and therefore believe it is fair to test models against future correlations because this is what a user would experience in a real setting.

## DETERMINING THE NUMBER OF FACTORS IN DEVELOPED EQUITY MARKETS

Retaining only the significant components of the sample correlation matrix should remove in-sample noise and provide better ex ante estimations of future stock correlations. As such, the factor combination that produces the lowest errors between estimated and realized correlations can be considered optimal for these purposes.

To construct an estimated correlation matrix, we first conduct a PCA on the sample correlation matrix. This provides an *N* × *N* diagonal matrix of eigenvalues and an *N* × *N* matrix of eigenvectors, in which the *i*th column is the eigenvector corresponding to the *i*th eigenvalue. An *N* × *N* estimated correlation matrix using the largest *k* principal components can then be constructed using the following formula:

where Λ_{k} is the *N* × *k* matrix of the eigenvectors associated with the first *k* eigenvalues, λ_{Ω,k} is the diagonal *k* × *k* matrix of the first *k* eigenvalues, *I _{N}* is the

*N*×

*N*identity matrix, and Ω

_{k}is the modeled correlation matrix, with the subscript denoting the number of principal components used. The second term on the right-hand side of the equation ensures the diagonals of the estimated correlation matrix equal 1.

To find the optimal number of principal components, at each test date we construct a series of *N* – 1 estimated correlation matrixes {Ω_{1}, …, Ω_{N–1}} using between 1 and *N* – 1 principal components. The root mean squared error (RMSE) between each estimated correlation matrix and the realized correlation matrix over the year following the estimation is then measured. If the true number of factors is *k*, then the ex ante correlation matrix calculated with *k* factors should show the lowest RMSE.

As noted earlier, we use the constituents of the FTSE Developed Index to calculate stock level correlation matrixes semiannually in March and September, starting in September 2000 and ending in March 2020. The first set of tests uses a two-year data window of daily returns.

The number of factors resulting in the lowest RMSE varies through time, with the average, minimum, and maximum being 10, 3, and 19, respectively (full details are in Exhibit A3). Exhibit 2 displays the average RMSE for estimated correlation matrixes formed using between 1 and 100 principal components, in which the average is across all the estimates made over the analysis period.

Plotting the RMSE highlights several key observations. First, principal components with small eigenvalues are noise, and removing these components always results in a better estimate of the ex post correlation matrix than the sample correlation matrix does. Second, the first few factors are essential—most of the error reduction is achieved with the first 5–10 factors. Third, an optimal range exists—there is a strong penalty for undermodeling (<3 factors), a steadily increasing penalty for overmodeling (using more than the optimal number of components), and an optimal range between these (typically 5–20 factors). Finally, even using the optimal number of factors, there is no significant difference in RMSE between the sample and filtered correlation matrixes. This suggests that although there is an optimal factor model for estimating correlation matrixes, the noise removed by this model does not significantly improve the accuracy of correlation forecasts.

Repeating the tests using one-, three-, four-, and five-year data windows does not produce materially different results; however, the number of principal components resulting in the lowest RMSE does increase monotonically with the data window length because more data allow for more factors to be captured. The average optimal number of factors for one-, two-, three-, four-, and five-year windows is 5, 10, 14, 17, and 20, respectively.

Although the number of factors captured varies fourfold between the one- and five-year data windows, the difference in RMSE is more modest; Exhibit 3 plots the average RMSE of both the sample and optimally estimated correlation matrixes for each data window. The average RMSE decreases as the data window increases, although the absolute difference is small.

Another observation from Exhibit 3 is that the difference in RMSE between the sample and optimally filtered correlation matrixes decreases as the data window increases; this is because the level of noise is naturally reduced by the additional data. We noted earlier that the RMSE improvement of a filtered correlation matrix over the sample matrix is small, and even with the shortest data window of one year, the difference remains so.

Plotting the RMSE against the number of components used in the estimation typically produces a smooth curve, which suggests that a factor’s predictive power is proportionate to its eigenvalue. However, this is not always the case—occasionally adding a factor with a highly ranked eigenvalue (e.g., factors 4 and 5 in Exhibit 4) will increase the RMSE, despite later factors continuing to reduce it. This likely reflects changes in market fundamentals between the estimated and realized periods. For example, Exhibit 4 shows correlation matrixes constructed using data spanning the Global Financial Crisis (September 2007–September 2009) to estimate correlations during the recovery period (September 2009–September 2010).

This example highlights a trade-off involved in using historic data, between prioritizing recent observations that reflect the current market regime and having sufficient observations to differentiate factors from in-sample noise. To examine this trade-off, we count the number of instances of each of the top five factors being redundant, meaning their inclusion in the estimated correlation matrix resulted in a higher RMSE; the results are displayed in Exhibit 5.

The evidence that longer data windows result in more instances of highly ranked factors having negative predictive power is mixed: For factors 3 and 4, the longer data windows clearly result in more instances of factor redundancy; however, for factors 2 and 5 this is not the case.

In this section, we have presented empirical evidence that the optimal number of factors for estimating developed equity market correlations is, on average, between 10 and 20. The results also found that the number of factors captured increases with the length of the data window, although larger data windows may risk capturing factors that are no longer relevant.

The various tests proposed in the literature seek to provide ex ante estimates of the number of principal components in a dataset to retain. In the next section, we compare and evaluate these tests.

## FACTOR RETENTION TESTS IN PRACTICE

This section evaluates the number of equity market factors estimated by the various tests suggested in the literature. Corresponding with the previous section, we use the constituents of the FTSE Developed Index and evaluate the number of factors semiannually in March and September, starting in September 2000 and ending in March 2020. The first set of tests uses a two-year data window of daily returns.

Exhibit 6 shows the average, minimum, and maximum number of factors estimated by

**1.**The modified AIC and BIC tests proposed by Bai, Fujikoshi, and Choi (Equations A1 and A2). These are applied to both the covariance (Σ) and correlation (Ω) matrixes, providing four tests in total.**2.**The six-panel criterion tests proposed by Bai and Ng (Equations A3 to A8): PC_{1}, PC_{2}, PC_{3}, IC_{1}, IC_{2}, and IC_{3}.**3.**Random matrix theory (RMT) Marchenko-Pastur distribution bounds (Equation A9).**4.**Parallel analysis, using both normally simulated data and random permutations of the original data (PA normal and PA permutations, respectively). The 95th percentile of the random data distribution is used for the upper bound, and 10,000 simulations were used for each test.**5.**The δ test proposed by Onatski.**6.**The ER and GR tests proposed by Ahn and Horestein (Equations A10 and A11). The ER test is applied to both the covariance (Σ) and correlation (Ω) matrixes.

The *k _{max}* is set at 30 for all tests requiring a predetermined maximum possible number of factors. This figure is arbitrary and is determined based on the observations of the optimal number of factors noted in the previous section. The dependence on this arbitrary input highlights one drawback of the panel criteria, Onatski delta, and ER and GR tests.

The BIC, RMT, parallel analysis, and Onatski’s δ tests all select, on average, around 20 factors, whereas Bai and Ng’s information criteria select around 11 factors. The ER, GR, and AIC are clear outliers, selecting, on average, 2, 2, and 200 factors, respectively. Given the observations on the optimal number of factors presented earlier (for the two-year data window case), Bai and Ng’s information criteria appear to be the most accurate tests.

Bai and Ng’s PC_{1}, PC_{2}, and PC_{3} tests prove sensitive to the parameter *k _{max}*. For example, if

*k*is set to 50, the tests result in respective averages of 14, 14, and 18; setting it to 100 gives averages of 22, 21, and 31. This sensitivity to an arbitrarily defined input is a clear drawback, making the IC

_{max}_{1}, IC

_{2}, and IC

_{3}tests, which do not require the specification of

*k*, preferable.

_{max}Onatski’s delta test is similarly sensitive to *k _{max}*: Setting it to 50 or 100 results in a mean of 26 and 32 factors, respectively. In addition to this sensitivity, our analysis finds that this test does not converge after two iterations, as found by Onatski, but instead oscillates between two points. This results in substantial variability in the identified number of factors between analysis dates (see Exhibit A3).

Ahn and Horenstein’s eigenvalue and growth ratio tests are not sensitive to *k _{max}*, and varying this input does not affect the results. However, as noted earlier, these tests are clear outliers in terms of the number of factors identified.

The parallel analysis tests conducted using randomly simulated normally distributed data and random permutations of the original data provide almost identical results, confirming the observation of Buja and Eyuboglu that the distribution of the data is not significant.

In the previous section, we observed that increasing the length of the data window allows more factors to be captured. Corresponding with this result, BIC, panel criteria, RMT, and parallel analysis all find increasing numbers of factors with longer data windows—this can be observed in Exhibits A4 to A7.

To assess which test provides the best estimate of the number of factors in the underlying dataset, we evaluate them in the same manner as in the previous section: by calculating the square root of the mean squared error between estimated and realized correlation matrixes. To do this, we construct correlation estimates using the number of factors provided by each test at each analysis date. We also calculate the RMSE using the sample correlation matrix as an estimate, which allows the improvement in the RMSE of each test relative to this base case to be assessed. Exhibit 7 shows the average improvement in RMSE, and Exhibit A8 shows the improvement through time. These results use a two-year data window; however, repeating the exercise with different data windows leads to the same conclusions.

We noted earlier that the penalty for underestimating the number of factors is large, whereas the penalty for overestimating (by a few factors) is comparatively small. Consequently, there is little difference in RMSE between the performance of models identifying between 6 and 20 factors during most periods. This is particularly evident in Exhibit A8, as is the outlier status of the ER and GR tests. Exhibit 7 also highlights the observation made earlier, that even if a factor identification test provides an accurate estimate of the true number of factors, the improvement in correlation estimates over the sample correlation matrix is relatively small.

## INTERPRETING LATENT FACTORS

Finding economic interpretations of latent factors is beneficial for several reasons. First, it adds context when quantifying the number of factors; second, it can shed light on the appropriateness of the factors used in observable factor models; and third, by identifying truly independent factors it can aid portfolio allocators in diversifying risks.

As we noted in the introduction, the eigenvector associated with a principal component is equivalent to a vector of factor coefficients or exposures. These coefficients can be used to evaluate the extent to which latent factors are associated with country, industry, and style factors (beta, momentum, quality, size, and value). For example, if a given latent factor’s coefficients are similar to a set of traditional CAPM stock market betas, this would suggest that the latent factor represents the market. The obvious caveat is that correlation does not necessarily imply causation and that the similarity could be due to a common relationship with a different, unidentified factor.

The coefficients of each of the first 15 principal components at each semiannual analysis date are analyzed, and the conclusions presented here are representative of observations that were persistent. We examine the relationship between observable characteristics and the eigenvectors of each latent factor from three perspectives:

**1.**To assess the extent to which latent factors are representative of stocks’ geographic or industry characteristics, eigenvectors are aggregated into country and industry groups. If a factor’s coefficients are concentrated in one or more of these groups, this factor is assumed to represent a common geographic or industry risk. Conversely, if the eigenvectors do not show any concentration along geographic or industrial lines, this would suggest these characteristics are inappropriate for explaining stock correlations.**2.**To measure the similarity between latent factors and the observable factor characteristics that would typically be included in an observable factor model, the correlation between factor sensitivities is calculated. Latent factor sensitives are calculated by multiplying eigenvectors by realized stock volatility. These latent factor betas are compared with style factor Z-scores and stock betas to global, regional, and country equity indexes and currencies. The correlation between the unadjusted eigenvectors and style Z-scores is also measured.**3.**To test how well the combined set of observable factors explains the latent factor coefficients, eigenvectors are regressed onto country and industry dummy variables and style factor Z-scores, as well as separately onto each of the three categories. The*R*^{2}value from these regressions is used to measure the appropriateness of fit.

We use the FTSE Russell Industry Classification Benchmark (FTSE Russell 2020) industry groups to classify stocks by industry in tests 1 and 3 and the FTSE Russell Global Factor Index Series factor (FTSE 2020) definitions for tests 2 and 3. All three tests employ a data window of two years to determine the set of eigenvectors. Each of the three tests generates the same conclusions: The first latent factor is a market factor, equivalent to the CAPM factor, with subsequent latent factors representing geographic and industry characteristics. There are no identifiable latent factors aligned to the momentum, quality, size, and value style factors. We elaborate on these conclusions in the following.

The correlation between the first latent factor’s betas and CAPM stock betas with respect to the FTSE Developed Index are typically above 0.95 (see Exhibit 8), and the distribution of this factor’s coefficients is materially different from that of all other factors. It is the only factor for which virtually all stocks have a positive coefficient (see Exhibit A9).

However, following this market factor, the pairwise correlation between latent factor coefficients and style factor Z-scores did not result in any notable discoveries. The highest correlation between value, quality, size, and momentum Z-scores and any of the top 15 latent factors at any semiannual snapshot is 0.36, 0.32, 0.55, and 0.44, respectively. However, the typical best-in-snapshot correlations are closer to 0.2, and a closer analysis of the anomalous higher correlations shows that even these latent factor exposures are better explained by geographic or industrial groupings. This leads to our second main conclusion, that once the market factor has been removed, the following 10–15 latent factors primarily represent stocks’ geographic and industry characteristics.

Factors 2 to 4 capture broad regional correlations between stocks: The second factor is consistently a long North America/short Japan factor (see Exhibits 8 and 9), the third factor is a developed market excluding Japan and North America factors (see Exhibits 8 and A10), and the fourth factor is an Asia excluding Japan factor, which has coefficients in the Australia, Hong Kong, South Korea, New Zealand, and Singapore groups (see Exhibit A11).

Following these regional factors, the next 10 or so factors form into country groups, industry groups, or combinations of both. This is the case for 97% of the top 15 principal components from each of the 40 snapshots analyzed. European nations (other than the United Kingdom) do not have identifiable country-oriented factors; however, Australia, Canada, Hong Kong, Israel, New Zealand, Singapore, and South Korea do, and they often have multiple factors at each snapshot. Examples and further details of these factors are given in Exhibits A13–A15 in the Appendix.

This observation suggests that European equity markets are heavily integrated and do not represent distinct country risk factors, whereas Asian markets are less integrated and do represent distinct factors. Exhibit A12 in the Appendix is an equity market country correlation matrix over the 2000–2020 period and confirms this observation: The average correlation among European markets is 0.71, whereas the average correlation among Asian markets is 0.52. Ordering countries by their average off-diagonal correlation (across all countries) illustrates that Japan, South Korea, Israel, Hong Kong, New Zealand, the United States, Singapore, Australia, and Canada are the nine least correlated; all are countries highlighted as having identifiable latent factors.

Identifiable industry-oriented factors are less frequent than country factors and, with the exception of commodity factors, less persistent. As detailed in Exhibits A16–A20, a commodity factor is present at every snapshot analyzed, whereas a cyclicals-versus-defensives factor and technology, financial, and healthcare industry factors are intermittently identifiable in various periods.

These observations are corroborated by the results of test number 3, which regresses eigenvectors onto observable stock characteristics. Exhibit 10 displays the average *R*^{2} of regressions taken over all 40 snapshots. The regressions show that following the market factor, coefficients are best explained by stocks’ geographic and industry characteristics.

In addition to adding context, the economic interpretations of latent factors provide further evidence that the number of factors in developed equity markets is between 10 and 20. Factors become increasingly less broad as their eigenvalue decreases, moving from the market to regions to countries and industries, after which only stock-specific effects remain. Eigenvector coefficients typically stop grouping along geographic or industrial lines between factors 15 and 20, although the exact cutoff between principal components representing factors and principal components representing idiosyncratic effects is difficult to define. The *R*^{2} values in Exhibit 10 similarly illustrate that there is no clear cutoff between factors and idiosyncratic effects.

With regard to the factors used in observable factor models, the results here suggest that they could be made more parsimonious. Regional factors are clearly important for modeling correlations, as are North American and Asian country variables. However, there appears to be little justification for including country-specific variables for European nations. For industries, there is clear justification for using a commodity-related industry variable and some more recent evidence of defensive and cyclical variables. Although not persistent throughout the whole study period, there is also evidence for technology and financial variables. Notwithstanding these exceptions, the results suggest that the typical 10–11 industry groups, and certainly the 30–40 sector groupings sometimes used, are superfluous for modeling stock correlations.

Finally, from a portfolio allocation perspective, the results suggest that for a developed market equity portfolio, risks can by diversified by allocating to countries with lower correlations with other countries, particularly those in Asia—we will examine this hypothesis in the next section.

## MINIMUM VARIANCE PORTFOLIOS

Under modern portfolio theory (Markowitz 1952) the global minimum variance portfolio (GMVP) is at the furthest left on the efficient frontier. This portfolio is not reliant on return estimates and can be constructed using just the covariance matrix. As an additional evaluation of the factor tests outlined in the literature, we employ the estimated covariance matrixes resulting from each test to construct minimum variance portfolios and assess their importance in reducing portfolio variance in the context of a series of backtests.

Our minimum variance portfolios are constructed by minimizing an objective function of the form

Subject to the constraints

where *w* is the *N* × 1 vector of minimum variance portfolio weights and Σ is the *N* × *N* modeled covariance matrix, which is calculated as

where Ω_{k} is the estimated correlation matrix using *k* factors, as defined in Equation 4, and σ_{i} is the realized volatility of stock *i* over the relevant data window. The portfolios are constructed using the constituents of FTSE Developed Index and rebalanced semiannually in March and September. The simulation period starts in September 2000 and ends in June 2020. In practice, additional constraints, such as limits on active country and industry weights, would be applied. Nonetheless, for our purposes a bare bones approach is more appropriate because this allows differences in the various models to come the fore.

It is worth noting that the focus of this article is on determining suitable parameters for estimating correlations, while the performance of the minimum variance portfolios will depend on ex ante predictions of both correlations and volatility. Volatility modeling is an expansive field with numerous sophisticated methods (e.g., EWMA, ARCH, GARCH) used in practice. We do not expand on these in this article so as not to dilute the focus of the results.

Exhibit 11 displays the portfolios’ annualized return, standard deviation, Sharpe ratio, and beta over the simulation period. The realized volatility of the minimum variance portfolios is more than 48% lower than that of the FTSE Developed Index. Exhibit A21 shows their rolling one-year volatility, which demonstrates that this volatility reduction is consistent. The realized volatility of the minimum variance portfolios constructed using the sample correlation matrix and the various modeled matrixes is essentially the same. This is unsurprising given the earlier observation that filtered correlation matrixes do not provide significantly better correlation forecasts than the sample matrix. It is an open question as to whether an observable model would provide a greater improvement in correlation forecasts and one which we hope to answer in future research.

The bottom five rows of Exhibit 11 display the same stats for portfolios constructed using the random matrix theory tests but using one-, two-, three-, four- and five-year data windows. There is more variability in the results here, with the realized volatility increasing monotonically with the length of the data window. This result is interesting because our analysis of data windows suggested that longer data windows provided (marginally) better correlation estimates. However, as noted previously, the performance of the minimum variance portfolios depends on both correlation and volatility estimates, and the discrepancy in results here is likely due to the shorter observation windows providing better volatility forecasts.

In the previous section, we demonstrated that the primary developed equity market factors are the market, regions, and countries and suggested that allocating to countries with lower correlations would diversify risk. The active GMPV country weights confirm this. Exhibit 12 displays the average active country weights of the GMPV constructed using RMT (the other models produce similar results). The portfolio overweights Hong Kong, Singapore, Japan, Israel, Austria, New Zealand, and Canada, all of which (with the exception of Austria) were highlighted earlier as being less correlated markets.

## CONCLUSION

In this article, we examined methods for determining the number of latent factors in a dataset and found that information criteria and random matrix theory provide the best results, despite approaching the problem differently. By observing the effects on estimated to realized correlation errors, we found compelling evidence that there are between 10 and 20 factors in developed equity markets. We also found evidence that longer data windows allow more factors to be captured.

We added context to these results by providing economic interpretations of the latent factors. Specifically, we found that the first factor is the market factor, factors 2–4 are regional factors, and factors 5–20 are primarily geographic factors with a few industry-oriented factors also present. Building on these observations, we demonstrated that a developed equity market minimum variance portfolio can achieve significant volatility reduction by allocating to less correlated countries.

Notwithstanding these results, we found that filtered correlation matrixes provide only a marginal advantage over the sample correlation matrix when estimating correlations. It is interesting to speculate whether alternative methods of dimension reduction, using observable factors or even nonlinear techniques such as neural networks (Gu, Kelly, and Xiu 2019), would provide better results, and we hope to contribute future research in this area.

## APPENDIX

### FORMULAS

Adjusted AIC and BIC criteria (Bai, Fujikoshi, and Choi 2018):

A1 A2where:

Information criteria (Bai and Ng 2002):

A3 A4 A5 A6 A7 A8Marchenko-Pastur distribution bounds:

A9Onatski (2008) algorithm:

Starting at *j *= *k _{max}*, the Onatski algorithm repeats the following steps to convergence:

**1.**Compute , the slope coefficient in the OLS regression of λ_{Ω,j},…,λ_{Ω,j+4}, on the constants ,…, and set .**2.**Compute*k*(δ) =*max*{*i*≤*k*: λ_{max}_{Ω,i}− λ_{Ω,i+1}≥ δ}, or if λ_{Ω,i}− λ_{Ω,i+1}≤ δ, for all*i*≤*k*and then set_{max}*k*(δ) = 0.**3.**Set*j*=*k*(δ) + 1 and go back to step 1.

ER and GR (Ahn and Horenstein 2013):

A10 A11where *V*(*k*) is the sum of the squared idiosyncratic returns in Equation 1.

- © 2021 Pageant Media Ltd