## Abstract

In this article, the authors run experiments to test if and how human subjects can differentiate time series of actual asset returns from time series that are generated synthetically via various processes, including AR1. In contrast with previous anecdotal evidence, they find that subjects can distinguish between the two. These results show that temporal charts of asset prices convey to investors information that cannot be reproduced by summary statistics. They also provide a first refutation based on human perception of a strong form of the efficient-market hypothesis. Their experiments are implemented via an online video game (http://arora.ccs.neu.edu). The authors also link the subjects’ performance to statistical properties of the data and investigate whether subjects improve performance while playing.

One of the most important and complex decisions individuals face is how to save and invest. Choices they make not only affect their own quality of life but may have an impact on the economy by creating dependencies on government-sponsored benefits; however, it is well documented that when it comes to investing, individuals are not well positioned to make sound decisions. Several explanations have been proposed in the literature, including overload of information about investment products from which to choose, marketing strategies designed to mislead, behavioral biases, and financial illiteracy (see, for example, Bazerman 2001, Bodie 2007, Choi, Laibson, and Madrian 2011, and the references therein). The problem of inadequate individual investment decisions is especially acute in the case of retirement savings; the recent shift from defined benefit pension plans to privatized 401(k) plans has forced individuals to, in effect, manage their own money. As a result, much debate among policymakers and academics has taken place about improving the quality and presentation of data available to investors. For example, Bazerman (2001) and Kozup, Howlett, and Pagano (2008) called for research on investors’ perceptions of investment products and ways of making the information about those products easy to access and comprehend.

An example of work in this direction is by Hung, Heinberg, and Yoong (2010). The authors evaluated versions of the Department of Labor’s proposed Model Comparative Chart, which provides a standard simplified disclosure format for investment information. They conducted an online experiment in which subjects are asked to allocate $10,000 among different funds based on the funds’ performance disclosure. In one version of the disclosure, past returns are presented as a numerical table. In another version, in addition to the numerical table, the disclosure shows a graphical representation of returns over a 10-year period as a bar chart. Perhaps surprisingly, the authors found that the two disclosures have a statistically significant effect on the retirement investment allocation, although the effect may not be practically significant in terms of investment outcomes.

Together with the prevalence of temporal charts of asset returns in financial media such as Yahoo! Finance and their widespread use by both casual and professional investors, the aforementioned points bring to the forefront a fundamental question: Just what information can human beings extract from charts of financial returns? This question has several ramifications. For example: Are there any patterns in financial asset returns that humans can actually extract by looking at such charts? Is seeing a chart more informative than just having a few parameters like, say, average and variance? Could Yahoo! and numerous other websites that display charts save space by getting rid of them altogether, with no harm to investors? In Hung, Heinberg, and Yoong’s [2010] experiment, is the mere *presence *of some chart biasing the subjects, or are subjects actually gathering information from the contents of the chart?

In this article we report the results of several experiments designed to test if and how human subjects can differentiate time series of actual asset returns from time series that are generated synthetically via various processes. Specifically, we consider time series obtained by permuting at random the samples of actual returns, as well as those arising from first-order autoregressive AR(1) models. Our experiments are implemented via an online video game (http://arora.ccs.neu.edu).

The main finding of this article is that humans can distinguish actual time series from synthetic ones. The results related to random permutations indicate that subjects perceive the temporal order of financial data. The results related to AR(1) indicate that subjects are employing more than just first-order autocovariance to differentiate the two time series. We also link the subjects’ performance to other statistical properties of the data and investigate whether subjects improve their performance while playing. For some contests, our results indicate that subjects do improve. Our findings are in contrast with previous anecdotal evidence. Specifically, it was argued that humans cannot tell price charts from “random,” such as charts generated by a random walk. For example, in an experiment (Malkiel 1973, p. 143), students were asked to generate returns (i.e., price differences) by tossing fair coins, and it was argued that doing so yielded observations that were indistinguishable from market returns to human subjects observing corresponding price charts. For similar arguments in the finance literature, see, for example, Roberts (1959), Kroll, Levy, and Rapoport (1988), De Bondt (1993), Wärneryd (2001), and Swedroe (2005). Such anecdotal evidence has also been collected in the computer science literature. For example, Keogh and Kasetty (2003) asked 12 professors at UCR’s Anderson Graduate School of Management to look at six time series and determine which three series are random walks and which three are real S&P 500 stocks. They found that “the accuracy of the humans was 55.6%, which does not differ significantly from random guessing.”

Our results are also of interest in light of the efficient market hypothesis, according to which “prices fully reflect all available information” and hence must be unforecastable (see, for example, Samuelson 1965 and Fama 1965a, 1965b, 1970). A strong form of this hypothesis presumes asset returns to be independent and identically distributed (see, for example, Fama 1970). In this case, it would be impossible to distinguish actual asset returns from a random permutation of them. But, again, we show that humans can do that.

Note that Lo and MacKinlay (1988, 1999) and Lo, Mamaysky, and Wang (2000) provided compelling evidence that markets are not efficient—that is, price data do possess statistical properties that noticeably deviate from random models. In fact, they showed that autocorrelation is such a property. However, we point out that the data analysis in all of these works is *computer* based, not *human* based. Consequently, the works leave open the question of whether markets look efficient *to human beings*. Our work appears to be the first to provide such an answer.

We note that the idea of testing the ability of human subjects to distinguish random versus real data using graphical representations is not new. Indeed, this has been studied in depth in the information visualization literature (see, for example, Heer, Kong, and Agrawala 2009 and Wickham et al. 2010, and the references therein). However, we are unaware of any previous work in which this idea has been used in a financial setting.

Similarly, we do not view the video game we developed as a main contribution of this article. This game displays data in a fashion similar to commonly used trading platforms, and similar tools are reviewed in the information visualization papers cited above. Instead, implementing the experiment as a video game is intended to make the process fun and engaging for the subjects so that they do not get tired, bored, or frustrated in a way that might affect their behavior. Moreover, the game allows the subjects to make their choices quickly, allowing us to get a large amount of data efficiently, with as little cost to subjects as possible.

## EXPERIMENTAL DESIGN

We developed a simple web-based video game in which subjects are shown two dynamic price series (i.e., moving charts) side by side—both of which display price graphs evolving in real time (a new price is realized roughly each second)—but only one of which is a replay of an actual historical price series. The other series is constructed via a synthetic process (see Exhibits 1 and 2, which are snapshots from our game). Subjects are asked to press a button indicating their selection of the actual price series and are informed immediately whether they were correct or incorrect (Exhibits 3 and 4), after which the next pair of price series is displayed. Note that the charts are moving, so at any point in time a certain number of observations are present on the screen for each time series, which are a subset of the total number of observations subjects see on a moving chart before having to make a guess (these parameters are reported for each dataset later in the article). Subjects do not have to wait until the entire moving chart has been displayed before making their choice and can guess at any time prior to its completion (an omnipresent counter informs them of the time left). The game is fast-paced: Subjects can observe the charts for 10 to 25 seconds (depending on the dataset) before having to make a guess.

For the actual time series we used eight datasets consisting of returns of commonly traded financial assets. These datasets were arbitrarily named after animals so that users had no knowledge of the specific financial assets used in the experiment. Exhibit 5 summarizes the data used and reports how many charts were shown to each subject, how many data points constitute a chart, and, because charts are moving, how many points of the chart fit onto the screen at any given time. Subjects were given 11 seconds to guess in the Bull contest; 15 in Bear, Elk, and Reindeer; 20 in Lynx and Mandrill; 22 in Seal; and 24 in Beaver. The Dow Jones Corporate Bond Price Index was obtained from the Global Financial Database, and all other data series were obtained from Bloomberg. Several statistics of the datasets are presented later. We use them to shed light on the difference in performance between the random permutation and AR(1) experiments.

Subjects were recruited via Amazon Mechanical Turk.^{1} After registration,^{2} a subject can participate in eight different contests, each consisting of the same game applied to different datasets. Participating in a contest consists of the following task. The subject is shown two dynamic price charts on a computer screen, one above the other (Exhibits 1 and 2). Each graph evolves through time—similar to those appearing in computer trading platforms—plotting the price at that point in time, as well as the trailing prices over a fixed time window over the most recent past. Of the two moving charts, only one corresponds to the sequence of market prices from the actual dataset; we call this graph the *real chart*. The other corresponds to a synthetic sequence of prices. We call this graph the *synthetic chart*. The computer chooses at random which of the two graphs is placed at the top and the bottom.

The subject is asked to decide which of the two moving charts is the real one by clicking on it. The game registers the subject’s choice and informs the subject immediately whether the guess is correct or incorrect (Exhibits 3 and 4). For each dataset, the user is shown approximately 35 pairs of moving charts and asked to make as many choices. The subject is also free to refrain from choosing. This happened rarely, and to err on the conservative side, we recorded the absence of a guess as an incorrect choice for that trial. To provide the participants with some incentive for making correct choices, we paid each participant as follows. We counted the number of correct guesses made by the participant minus the number of wrong guesses. Call this difference N. If N was larger than 0, then we paid N dimes.

To evaluate the robustness of our experimental design, we varied parameters of the experiment across datasets, as indicated in Exhibit 5. In addition, we presented subjects with data charts in two different ways. For half of the datasets corresponding to transaction-by-transaction (or tick) data, each subject was shown a fresh set of charts based on a sequence of returns disjoint from the sequences shown to any other subjects. For the other half of the data, corresponding to daily data, the charts shown to each subject were based on the same sequence of returns.^{3}

Finally, for each dataset, subjects were required to train on a disjoint set of data before entering the contest.

## SYNTHETIC PROCESSES AND RESULTS

In this section we describe the various synthetic processes we considered and the corresponding results. In each case we begin with a time series of actual historical prices {*p*_{0}, *p*_{1}, *p*_{2}, *…*, *p _{T}*} and generate from it a synthetic series {}. When displayed during the game, each series is scaled so that its maximum and minimum lie on the borders of the window on the computer screen.

### Random Permutation

Here we want to test the null hypothesis, H, that human subjects cannot distinguish between an actual time series and a time series that is obtained by permuting at random the entries of the actual one. Details follow.

We begin with a time series of actual historical prices {*p*_{0}, *p*_{1}, *p*_{2}, *…*, *p _{T}*} and compute the logarithmic returns {

*r*},

_{t}From this, we construct a randomly generated price series {} by cumulating randomly permuted returns:

where π(*k*) is a uniform permutation of the set of time indexes {1, *…*, *T*}. A random permutation of the actual returns does not alter the marginal distribution of the returns, but it does destroy the time-series structure of the original series, including any temporal patterns contained in the data. Therefore, the randomly permuted returns will have the same mean, standard deviation, and moments of higher order as the actual return series but will not contain any time-series patterns that may be used for prediction. This construction will allow us to test specifically for the ability of human subjects to detect temporal dependencies in financial data.

The results are reported in Exhibit 6. In particular, for each contest we report the *p*-value of the two-sided *t*-test of the null hypothesis, according to which the average number of correct guesses across subjects equals the total number of guesses in the contest divided by 2.^{4} We also report the correct guesses per subject as a percentage of the total guesses. The exhibit shows that the null hypothesis is refuted for all eight datasets: The *p*-value is always less than 1%.

### A Variant

To evaluate the robustness of the results, we also considered the following variant of the process, where returns are simply obtained via price differences:

From this, we construct a randomly generated price series {} by cumulating randomly permuted returns:

For this variant, we also changed the recruitment and incentive mechanisms. To recruit subjects, an announcement was emailed to Northeastern computer science students, MIT Sloan MBA students in the Fall section of 15.970, members of the American Association of Individual Investors mailing list, the Market Technicians Association (MTA) mailing list, the MTA Educational Foundation mailing list, and the staff and Twitter followers of Trader Psyches. As an incentive, we offered a $100 Amazon gift certificate to the top scorer in each contest.

Results for this variant are reported in Exhibit 7. The *p*-value is less than 6% for all but one dataset. We attribute the slightly less decisive outcome for this variant to the smaller number of subjects.

### AR(1)

Here we want to test the null hypothesis, H, that human subjects cannot distinguish between an actual time series, *S*, and a time series that is generated by an AR(1) process calibrated to match the mean, variance, and (first-order) autocovariance of *S*. The details follow. We refer to Section 3.4 by Hamilton [1994] for background on AR(1) processes.

Again, we begin with a time series of actual historical prices {*p*_{0}, *p*_{1}, *p*_{2}, *…*, *p _{T}*} and compute the logarithmic returns {

*r*},

_{t}Then we compute the sample mean, μ, variance, ν, and (first-order) autocovariance, α, of the series *r*. This defines an AR(1) process

where *ε _{t}* are independently and identically distributed normal distributions with mean 0 and variance σ

^{2}, as follows:

The starting point, *y*_{0}, of the AR(1) process is taken to be *r _{h}* for an index

*h*chosen uniformly at random.

Finally, we set

The results are reported in Exhibit 8. We obtain a *p*-value less than 0.505% for five of our eight datasets and higher than 0.505% for the other three.

### Comparison of Random Permutation and AR(1) Results

In this section we investigate whether subjects do better when presented with the permutation process than with an AR(1) process. As a first step, in Exhibit 9 we present the results of one-sided, independent-sample *t*-tests for a success rate decline between the random permutation and AR(1) experiments, reported in Exhibits 6 and 8, respectively. For each contest, the null hypothesis is that the average of the success rates, across subjects, is equal for the two experiments. The alternative hypothesis is that the average success rate in the AR(1) experiment is lower than the average success rate in the random permutation experiment. The success rate of a particular subject is defined as the number of correct guesses divided by the number of charts in the contest. For six of eight datasets, the null hypothesis is rejected at least at the 10% level.

To gain insight into differences in performance between the two experiments, in Exhibit 10 we present the first five autocorrelations and the Ljung-Box Q statistic, computed using 20 lagged terms of the actual and synthetic data. The Q statistic tests the null hypothesis that autocorrelations up to lag 20 equal zero—that is, it tests for overall randomness in returns. The first thing to notice is that for the actual data (Panel A of Exhibit 10), we reject the null hypothesis of overall randomness in each of the eight datasets with high confidence (the *p*-values of the Q statistic are 0.000). Under random permutation, we fail to reject the null (Panel B of Exhibit 10). The fact that the overall randomness of the shuffled data differs greatly from that of the actual data helps explain why subjects performed so well under the permutation experiment.

On the other hand, for the AR(1) process, we do reject the null hypothesis of overall randomness (Panel C of Exhibit 10). In fact, for six of the eight contests, the *p*-values of the Q statistic are the same as those of the actual data, up to three decimal places. For the remaining two contests, we reject the null at the 1% level in one case and at the 10% level in the other case. This similarity in overall randomness between the actual and AR(1) data helps explain why the subjects had a somewhat harder time in the AR(1) test. Interestingly, this similarity appears especially pronounced precisely in the three contests in which the subjects had the hardest time distinguishing the AR(1) process from the actual data: Mandrill, Lynx, and Beaver.

### Learning

Because our game provides feedback, we investigate whether subjects improve their performance while playing. We do so by comparing performance in the first and last parts of each contest. Specifically, for each contest, we consider the subsets consisting of the first one-fifth of guesses and that consisting of the last one-fifth.^{5} For each subset, we add up the number of correct guesses across subjects and divide that sum by the total number of guesses in the subset times the number of subjects. We call this the fraction of correct guesses made by the combined pool of subjects. We refer to this fraction in the first (last) part of each contest as *Correct First* (*Correct Last*).

Exhibit 11 reports the results for the permutation and AR(1) processes. For the permutation process (presented in Panel A of the exhibit), the average number of correct guesses increases in all but one contest. To check whether this increase is statistically significant across contests, we conduct a one-sided Wilcoxon signed rank test on the results presented in the exhibit. In particular, we test the null hypothesis that Correct First minus Correct Last comes from a distribution with zero median against the alternative that the median of the Correct First column is less than the median of the Correct Last column.^{6} Exhibit 11 shows that the increase in correct guesses is significant at the 10% level.

Panel B of Exhibit 11 reports the results for the AR(1) process. Across all contests we cannot reject the null hypothesis that the median success rate is the same in the first and last fraction of guesses; however, in some cases, such as Reindeer or Bull, the difference in the average success rates seems significant. Indeed, Exhibit 12 shows that if we conduct a significance test contest by contest (rather than across contests, as in Exhibit 11), we find evidence of learning in three out of eight contests under either random permutation or AR(1). Here, for each subject in a given contest, we take the fraction of correct guesses in the first one-fifth of guesses and the fraction of correct guesses in the last one-fifth of guesses. For each contest, we then report the *p*-value of the one-sided *t*-test of the null hypothesis that, subject by subject, Correct First minus Correct Last comes from a distribution with zero mean, against the alternative that the mean of the subject-by-subject Correct First data is less than that of the corresponding Correct Last data. The exhibit shows that for the permutation process, learning is statistically significant for Mandrill, Bear, and Elk at least at the 10% level, whereas for the AR(1) process, it is significant for Bear, Reindeer, and Bull at least at the 5% level.

Recall also that in our experiment subjects are required to practice before entering a contest. This makes the results in this section less likely to be influenced by extraneous factors such as becoming comfortable with the interface.

## CONCLUSION

A natural question that arises is how the subjects were able to perform so well in seven of eight datasets. Casual inspection of Exhibits 1 to 4 shows that distinguishing real data from synthetic data is challenging; for some datasets the real chart tends to be smoother, as in Exhibit 2, whereas for other datasets the opposite is true—the real chart tends to be spikier, as in Exhibit 4. What complicates the matter further is that, as is evident from the data, the smoothness of actual data varies with time. Still, feedback from just a few trials seems sufficient for the user to extract characteristics of the data to be used in classifying charts in the near future. The importance of feedback is supported by the information about winning strategies that some of the subjects volunteered to share with us (anonymously). For example, a subject wrote:

Admittedly, when first viewing the two data sets in the practice mode, it is impossible to tell which one is real, and which one is random, however, there is a pattern that quickly emerges and then the game becomes simple and the human eye can easily pick out the real array (often in under 1 second of time).

For some contests, our results suggest that, indeed, subjects improve while playing.

An interesting future research direction would be to compare humans’ performance against the performance of computers, following a vast literature (cf., Lawrence et al. 2006). In our experiment, the human eye—as opposed to a computer algorithm—may have an advantage. It is well known that computers still struggle with many image-recognition and classification tasks that are trivial for humans. The same may be the case for distinguishing asset returns from synthetic processes.

Given the recent regulatory push toward ensuring that “consumers have the information they need to choose the consumer financial products and services that are best for them,”^{7} the study of optimal ways to present financial data to investors is of current interest. Our article is a contribution to the growing body of literature on the usefulness of temporal charts in evaluation of financial asset performance.

## ACKNOWLEDGMENTS

We would like to thank Zvi Bodie for pointing out to us the work by Hung, Heinberg, and Yoong [2010]; Michael Coen for helpful discussions on a preliminary version of this work, especially on *p*-values; and Jason Chen, François Gourio, and Jawwad Noor for helpful feedback and discussions. We are also grateful to Hamid Jahanjou for his help with Amazon Mechanical Turk. Emanuele Viola was partially supported by a grant from MIT and NSF grants CCF-0845003 and CCF-1319206.

## ENDNOTES

↵

^{1}The advertisement read: “The game you are about to play is part of a research project which studies how humans see random data. The game has been designed to be fun to play: we are going to show you pairs of charts; in each pair, one chart is based on real data (such as price fluctuations) and one is randomly generated. You are required to indicate which one you think is real by clicking on it.”↵

^{2}In the first iterations, we had subjects answer a short demographic questionnaire that included a question about financial literacy. We found no correlation between their answers and their performance in our experiment, so we discontinued the questionnaire.↵

^{3}The data, however, were shifted by a random amount for security reasons—that is, to avoid the possibility that two subjects could coordinate their guesses, for example, by simultaneously playing the same charts on two nearby machines.↵

^{4}We use the same test for other synthetic processes considered later.↵

^{5}Non-integer numbers are rounded down to the nearest integer.↵

^{6}We use the signed rank test rather than the*t*-test because of the small sample size.↵

^{7}The Consumer Financial Protection Bureau: http://www.consumerfinance.gov/protecting-you/.

- © 2019 Pageant Media Ltd