top of page
Reference Equity

AI & Stock Selection

  • Writer: Ryan Bunn
    Ryan Bunn
  • Feb 19
  • 4 min read

Updated: Feb 24

AI improvements, to date, have been driven by larger datasets—High-quality, fundamental investing data is limited—Due to data limitations, AI is unlikely to “solve the markets” anytime soon.

 


AI & STOCK SELECTION


While optimism abounds regarding the future potential of AI, progress allows us to increasingly understand the limits of the technology, particularly in asset management. To date, the improvement in AI performance is largely a result of additional “parameters” added to the models. These parameters are data, and we are running out.


Parameterization & The Data Limit


The most recent OpenAI model, GPT-4, has approximately 1.8 trillion parameters. These parameters were trained on 13 trillion tokens – snippets of text – and compressed into the model.


The internet is estimated to have 60-160 trillion tokens of data. The value of these tokens is debatable, as much of this data comes from Wikipedia, Reddit, and other (arguably) low-quality sources. High-quality data from Common Crawl, one of the largest sources of public web data, is estimated at 10-15 trillion tokens.[1]


Every book ever written, if digitized, would generate only 14 trillion tokens – the amount GPT-4 is currently trained on. Instead of training GPT-4 on the 13 trillion internet tokens, it could be trained only on books. However, the value of knowledge encoded in romance and self-help books may also be debatable. High quality book tokens are estimated at 1-2 trillion.[1]


In total there are roughly 16-22 trillion tokens of high-quality data in existence. So today, at the very beginning of the age of AI, we are nearly out of data to train general AI models. We should expect continued improvements in AI performance as we enhance our training processes and improve our usage of the models, but the process of making an AI smarter by simply dumping in more data is likely coming to an end.


A Focused Approach


Agentic AI is our current approach to leveraging AI in the real world. General models provide a helpful search function but often do not access the data we require to make decisions. Training models on specific data subsets offers a clear solution to this problem.


AI agents, trained on specific data (such as company invoices or HR manuals), will better provide answers to context-dependent questions. This is the clearest path for AI to replace human labor, particularly for routine, repetitive tasks such as processing payroll, answering questions on benefits, or following up on late customer payments.


Simple tasks often need only a modest amount of data to allow an AI to step in. For example, combining an inside trading policy, SEC guidelines, and precedent legal rulings could be enough to create an AI agent to conduct training sessions or answer one-off questions for employees.


The Limits of AI


But more data is required to train an AI to evaluate complex scenarios. Asking our theoretical trading policy agent to address a gray area in insider trading, such as evaluating the materiality of a piece of non-public information, would require much more training. Even with this agent, building trust in the answer would require substantial testing and usage.


Given the dual hurdles of training data requirements as well as evaluation/trust building, what does the future look like for AI stock picking?

 

The Lack of Data in Financial Markets


Financial markets are awash in data. Billions of trades are made daily and securities are priced every second of every day. SEC filings for every business are available every quarter. Detailed data going back 50-100 years is digitized and ready for AI training.


However, this data is low quality. Like other quantitative approaches, AI may be able to analyze this data to identify inefficiencies between trades and ticks, but this data provides limited insight into long-term investing.


Rich Data


Only 43% of businesses have outperformed bonds in their lifetimes.[2] This data is from the CRSP database of ~25,000 securities, so a data set of outperforming businesses includes only ~11,000 companies to analyze. While we could create billions or trillions of “tokens” of data on this set of businesses, is it enough to give an AI differentiated stock picking ability?


Importantly, the marginal value of data declines as a dataset increases. Only 4% of businesses are responsible for all net wealth creation—our high quality dataset is reduced to 1,000 businesses. While AI will certainly be able to produce compelling investment pitches, an AI stock picker is more likely to produce the average returns of its human counterparts than achieve an “AGI” level of super-human stock picking.


Fundamental Investing


A general model, trained on trillions of parameters, will have more data “in its head” than any active manager. This general model will be able to converse regarding human biases, market reflexivity, investment crises, etc. But when drilling down to long-term, fundamental stock picking, the well will run dry.


The most successful fundamental investors employ heuristics and non-statistically significant empirical data to move the investment world forward. Buffett’s lessons are largely qualitative. Empirical support for active value investing, although well articulated in Buffett’s “Superinvestors” article, does not have a statistically significant sample (is Buffett just the luckiest monkey?).


Ultimately there is not enough data to conclusively prove the validity of active management, value investing, or a specific investment philosophy. In the face of limited data AI hallucinates, and a single hallucination destroys the foundation of trust that active management must be built upon.


Computational Irreducibility


I expect AI to outperform my wildest expectations. I certainly did not envision the internet and the changes it has brought to society. But I can say that the internet did not change fundamental investing or invalidate value investing tenets. In fact, the tech bubble is now a key point of support for value investing.


Fundamental investing, one of the most competitive endeavors in the world, may ultimately be computationally irreducible, unable to be solved by plugging variables into a computer, no matter how powerful the computer or AI model it contains.

  

[1] Sourced from Grok 2...so who knows if it’s right!  


bottom of page