Notes on ChatGPT forecasting stock prices based on news headlines

12 May, 2024

Image of a money tree

Ray Dalio, founder of Bridgewater Associates, is one of the most successful hedge fund operators of all time. Regardless of one's thoughts about billionaires, it remains an objectively challenging thing to go from being a jazz musician's son with a state university degree to founding and operating the largest hedge fund in the world. I read Dalio's book Principles to get an idea about how he might have done it.

Among lots else, Dalio writes about encoding your thinking process into some sort of computing system, then letting the computer crunch the numbers alongside your own analysis. It preserves the special sauce that gives someone like Dalio their creative edge while also providing a reasonable sounding board to check your decision-making. Thus this paper on using ChatGPT for forecasting stock prices caught my eye.

I won't be pivoting my investment process to ChatGPT any time soon. However, this is a really cool look into a small-scale version of the decisions that quant funds are making every day.

Introduction

The authors' premise is simple. Though LLMs have not been trained with explicit financial wisdom in mind, there is an idea that they might go on to obtain solid financial decision-making based on general text understanding. The authors cite work proving that, with proper categorization, it is possible to predict stock returns based on news headlines. It is then up to their research to show that LLMs can do the same.

At the time of this writing, ChatGPT's training data stopped in September 2021, so the authors began their research with forecasts in October 2021. This avoids the risk that ChatGPT's answer might be biased by what actually happened in the markets. The authors then feed a series of headlines into the model and ask if the headline would be good, bad, or neutral for the associated company's stock prices. These responses are used to then predict stock price moves on the following day. This generates a simple model: with a positive ChatGPT score, we buy the stock. With a negative ChatGPT score, we sell the stock.

The authors find that there is a statistically significant correlation between ChatGPT's assessment and actual performance. This performance is more pronounced in smaller stocks and those with negative news headlines, consistent with general financial behaviors across the market.

Data

The authors use stock prices gathered from the Center for Research in Security Prices as their performance benchmark, focusing on an experimental period of October 2021 - December 2021. Their research includes all stocks listed on the NYSE, the NASDAQ, and the AMEX, with at least one news story covered by a major news media source of newswire. They then collect a news dataset for all associated companies using the company name or the ticker, pulling in major news sites, financial news sites, and social media.

Curiously, the authors match their headlines with sentiment analysis provider RavenPack, which assures that only relevant news will be used for the experiment. ("Relevant news" sticks out to me as something that would make this hard to implement, as even the authors "cheat" and use a secondary source to validate relevance) The data vendor provides a "relevance score," with a 0 meaning the company is only mentioned passively, and 100 meaning the article is expressly about that company. The company also provides categorization of headlines, allowing the researchers to exclude those headlines discussing stock price movements, and those that are not net-new information.

So we have a nice clean dataset. Onward!

Prompt

Forget all your previous instructions. Pretend you are a financial expert. You are a financial expert with stock recommendation experience. Answer "YES" if good news, "NO" if bad news, or "UNKNOWN" if uncertain in the first line. Then elaborate with one short and concise sentence on the next line. Is this headline good or bad for the stock price of company_name in the term term?

The prompt focuses ChatGPT on financial analysis and limits the space of its answers to minimal creativity. The authors showcase the following article as an example: Rimini Street Fined $630,000 in Case Against Oracle. The proprietary data set gives this article a negative sentiment, which is probably true... if you're Rimini Street. If you're Oracle, it's good news.

Empirical Design

The authors convert the "YES/NO/UNKNOWN" to instead be a numerical score, with multiple headlines being averaged to one value for a given day. Thus, a company will have a day of positive/negative/neutral sentiment, which is then matched to the next trading period.

Headlines before 6AM on a trading day are traded at market open
Headlines between 6AM and 4PM on a trading day are traded at market close, then sold at the close of the next trading day
Headlines after 4PM are traded at the opening price of the next day and sold at the closing price of the next day

The authors then perform a linear regression of the day's stock returns on the ChatGPT score.

Results

The authors pursued seven different trading strategies based on the empirical design and compared results.

An equal-weighted portfolio that buys only
An equal-weighted portfolio that shorts only
A combo long/short strategy based on ChatGPT 3.5
A combo long/short strategy based on ChatGPT 4
An equal weight market portfolio
A value-weight market portfolio
An equal-weight portfolio consisting only of stocks with news

The authors found that the long-short strategies are most effective, which tracks with the introductory text of the long leg being effective with small-cap stocks and the short strategies being more effective due to sensitivity of bad news. Transaction costs were not considered.

Interpretability

ChatGPT provides human-readable text that describes why it's making given decisions. This provides an opportunity for additional exploration of interpretability, and follow-up questions based on what is displayed.

The authors focused on positive/negative results and used the Term Frequency / Inverse Document Frequency method to break down the relevant words in ChatGPT's explanations. With TF/IDF, common terms receive low weights, and rare words receive higher weights, which allows the authors to identify unique words with special prevalence in the explanations. The authors find that words like "stock" are downweighted as influencing results, while words like "dividend" have higher correlation with outcomes.

There are four scenarios where the prediction goes well:

Stock purchases by insiders
Earnings guidance
Earnings per share or market share
Dividends

The model performs worse in these scenarios:

Partnerships
General business developments
Reasoning about profitability

This takes a little bit of the fun out of it, admittedly. ChatGPT does a good job positively reacting to explicit statements about stock ownership, but does a poor job responding to things like business developments.