- The ML problem
- Supervised Regression Learning
- How it works with stock data
- Backtesting
The ML problem
In most cases, we use machine learning algorithms to build a model. A model is essentially a function: it receives an input X and produces an output Y.
Typically, X is a measurement or set of measurements that we have taken or observed. The model processes X and produces the output Y, which is usually a prediction about the world. For example, X might be the current price of a stock, and Y might be the future price of that stock.
The machine learning process involves running historical data through a machine-learning algorithm to derive a model based on that data. Later, when we need to use the model, we pass in new examples and receive fresh predictions.
What’s X and Y quiz
Let’s think about building a model to use in trading. Which of the following factors might be input values (X) to the model, and which might be output values (Y)?
Since we often use models to predict values in the future, both future price and future return make sense as output values.
Supervised Regression Learning
When we talk about regression, we are talking about making a numerical approximation or prediction. For example, predicting stock prices is a regression problem. Regression learning sits in contrast to classification learning, which involves classifying an object into one of several types.
In supervised learning, we present the algorithm with the correct answers during the training period. In other words, we show the machine an X value and its corresponding Y value. After seeing a sufficient number of input/output pairs, the algorithm is ready to predict Y values for new, previously unseen X values.
Finally, when we talk about learning, we are talking about training with data. In this class, for example, we are often taking historical stock data and training the system to predict the future price of the stock. We “teach” the algorithm to make new predictions by showing it relevant data from the past.
Many different algorithms use supervised regression learning techniques, and we can discuss four briefly here: linear regression, K-nearest neighbor, decision trees, and decision forests.
Linear regression involves finding the coefficients of a line that best fits the data. The coefficients are the parameters of the model, and linear regression is known as a type of parametric learning.
K-nearest neighbors (KNN) is another popular approach that evaluates an incoming X by examining the K-nearest (X,Y) pairs. Since this algorithm compares incoming data to instances of already-seen data points, this approach is known as instance-based learning.
Decision trees redirect incoming X values down individual branches based on evaluations of factors of X that occur at each node. The regression values are stored in the leaves of the tree, and the output of a decision tree for a given X is the leaf in which the X ends up.
Decision forests are composed of many decision trees queried in turn and combined in some way to provide an overall result.
How it works with stock data
Let’s consider how we can generate the type of data we can feed into a machine learning algorithm in order to build a model.
Assume we have a pandas DataFrame containing historical features of a particular stock, arranged in the usual way: one row for each date with columns representing each metric.
We might have many features for each stock, such as Bollinger Bands, momentum, and P/E ratio. We can represent each feature in a DataFrame, and then “stack” the DataFrames one behind the other.
These features make up the input values - the X - for the model that the machine learning algorithm synthesizes. In most cases, we want to use our model to produce a future price - the Y - given this historical feature data as input.
Let’s see how we can generate these examples starting from the first date d for which we have data. We can look at the stock features for d, and the price at some future date, such as d + 5. Note that even though the price is historical, it’s in the future relative to d.
This pairing between features X at d and future price Y at d + 5 comprises our first training example. We can step forward day-by-day to generate subsequent examples mapping X at d + n to Y at d + 5 + n. Once this process is complete, we will have a large set of examples that we can use to build our model.
Backtesting
Once we have our model, we might start to get curious about the accuracy of the forecasts it provides and whether we can act on them.
We can evaluate the performance of our model through backtesting. Backtesting is the process of applying a trading strategy or analytical method to historical data to see how accurately the strategy or method would have predicted actual results.
First, we limit our data to some subset and train a model on just that data. Next, we ask the model for a forecast of some time in the simulated future. Third, we place orders on the basis that the forecast is accurate, shorting or longing stocks as appropriate.
We can then roll time forward and see how the performance of our portfolio - measured by Sharpe ratio, daily return, or something similar - changes as the forecast does or does not come to fruition.
Finally, we repeat this process. We train a model based on a more recent subset of the data, make a prediction, and plot performance based on orders made in line with the prediction.
We can backtest our strategy against historical data to simulate the performance of our portfolio according to the strategy. Ultimately, the performance informs us whether the strategy is worth adopting.