Neo Zhao & Timothy Yang
Wealth is a highly sought after status that many struggle to attain but few achieve. It comes in many forms and has many more uses. Some may choose to save their life earnings to slowly aquire wealth, but some braver may choose to risk their earnings in the stock market.
Among these there are those who try to invest for the long-term profit and those who invest for the short-term gain. The goal of the long-term investors is oftentimes to grow their wealth at a rate exceeding inflation or at the very least to outperform their next best investment (,such as the interest rate banks provide). The invest in companies that they believe will continue to grow, because they believe in the profitability of the companies' business model.
However, the more daring believe in 'beating the market', which is to invest in such a way to outperform the S&P500 index. They believe that at any given point in the market some stocks are overrated, so should be sold, and some are underrated should be bought. In doing so, they will outmaneuver their adversaries, other traders, and profit. Many try, but most fail. More information about 'beating the market' can be found here https://www.investopedia.com/ask/answers/12/beating-the-market.asp#:~:text=The%20phrase%20%22beating%20the%20market,beat%20it%2C%20but%20few%20succeed
The S&P500 (Standard & Poor's 500) represents approximately the top 500 largest publicly traded companies in the U.S.. It is often regarded as the best gauge to how well major buisinesses are performing for some time period. To learn more about the S&P500 visit this site https://www.investopedia.com/terms/s/sp500.asp
In the past humans would analyze the trend in the market to make their decisions. However, with the rapid improvement of computer processing speed and machine learning algorithms, trading is largely done by and between computers. Computers will analyze the trends of the market, and try to outperform their adversaries, other computers; the one with the more accuracte algorithm or better luck will profit.
Today, we will walk you through a simple tutorial of a rudimentary algorithm for modeling the stock market, with the goal of 'beating the market'. In the model we train, we will be predicting whether the stocks of a company will go up or down from day to day. Based on this prediction we can determine whether or not to buy or sell stocks.
For example if we sold a stock when we expect a stock to lose value, we can sell stocks today and buy them back tomorrow. If we are correct in our prediction then we would have earned the difference in value times the number of stocks we sold. If we think a stock will gain value, we can buy stocks today and sell them back tomorrow. Again we would have earned the difference in value times the number of stocks bought.
However, some constraints we face is that we do not have access to the most recent data, we are unable to model how our machine's decision influence the market, and we have limited processing power. The goal of this tutorial is not to build a model that will truly outperform the market, but to guide the audience on how to develop such a model.
# imports
import requests
from bs4 import BeautifulSoup as bs
import pandas as pd
import datetime as dt
from datetime import datetime
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import tensorflow as tf
from tensorflow import keras
import math
from sklearn.model_selection import cross_validate
In this project, we'll mainly focus on the S&P 100, because we have limited computational resources. That said, there was this really nice table from wikipedia about the S&P 500 (Source: https://en.wikipedia.org/wiki/List_of_S%26P_500_companies), which contains information about the company and its industry classification. We scrape this wikipedia's table below:
r = requests.get('https://en.wikipedia.org/wiki/List_of_S%26P_500_companies')
root = bs(r.content, "html")
sp500 = pd.read_html(root.find('table').prettify())[0]
sp500.head()
To isolate just the S&P 100, we need some way to identify them, so here we also scrape another table from wikipedia which lists the S&P 100 (Source: https://en.wikipedia.org/wiki/S%26P_100):
r = requests.get('https://en.wikipedia.org/wiki/S%26P_100')
root = bs(r.content, "html")
sp100 = pd.read_html(root.find_all('table')[2].prettify())[0]
sp100.head()
For this section, all of our stock data comes from a dataset from Kaggle:
https://www.kaggle.com/borismarjanovic/price-volume-data-for-all-us-stocks-etfs
This data set is massive, containing information on over a 8500 funds and stocks, each stored in their own files. I've (Neo) downloaded the dataset and put it on a publically available repository to facilitate its use in this project.
We'll only pull the data which corresponds to the S&P 500 to match with the data we pulled from the S&P 500 table since it would be a waste of memory to load in all 8500+ files into dataframes. The raw github repo links are mostly similar, with the individual files named after the stock name. So, below we use common url heads and tails to pull datasets for just the S&P 500 companies found on the list in wikipedia.
url_common_head = 'https://raw.githubusercontent.com/neo-zhao/CMSC320_Final_Tutorial_Huge_Stock_Market_Dataset/main/Stocks/'
url_common_tail = '.us.txt'
sp500stocks = pd.DataFrame()
unavailable_stocks = []
for i in range(sp500['Symbol'].size):
try:
data = pd.read_csv(url_common_head + sp500['Symbol'][i].lower() + url_common_tail)
data['Symbol'] = [sp500['Symbol'][i]]*data.shape[0]
sp500stocks = pd.concat([sp500stocks, data])
except:
print(sp500['Symbol'][i].lower() + url_common_tail + ' does not exist in the Huge Stock Market Dataset.')
unavailable_stocks += [sp500['Symbol'][i]]
print(str(len(unavailable_stocks)) + ' companies\' data could not be found in the Huge Stock Market Dataset')
sp500stocks.head()
Now, we have everything we need and just need to do a lot of clean up; cutting out irrelevant data and refining what is left. So, it's on to Data Processing from here.
In this section, we make use of Pandas and NumPy to manipulate our dataframes, which are Pandas based objects. Below, I link to the docs for these libraries, so you can explore more of their functionality there:
As seen when compiling data from the Huge Stock Market Dataset, some of the companies don't have corresponding data in our dataset.
# print inital size
print(sp500.shape[0])
# remove the unavailable stocks
sp500_1 = sp500[~sp500['Symbol'].isin(unavailable_stocks)]
# print new size
print(sp500_1.shape[0])
Now that that's done, lets late a look at the features of this dataframe:
sp500_1.head()
Most of this looks fine as is, but there are several improvements that can be made:
With these tasks listed one, let's go through them one by one.
sp500_2 = sp500_1.drop(columns=['SEC filings'])
sp500_3 = sp500_2.drop(columns=['CIK'])
sp500_3.head()
Apparently, there are some inconsistent formatting in this section as well. Some strings had extra parenthesised dates tailing the initial date, and some are just NaN. Since we already have an abundence of data, we will just ignore these data points and drop them.
datetime_date_first_added = []
dropped_companies = []
for r in sp500_3.iterrows():
try:
datetime_date_first_added += [datetime.strptime(r[1]['Date first added'], '%Y-%m-%d')]
except:
datetime_date_first_added += [np.NaN]
dropped_companies += [r[1]['Symbol']]
sp500_4 = sp500_3.copy()
sp500_4['Date first added'] = datetime_date_first_added
sp500_4.head()
And now to remove the rows with NaN values:
# print initial size
print(sp500_4.shape[0])
# remove dropped companies
sp500_5 = sp500_4[~sp500_4['Symbol'].isin(dropped_companies)]
# print new size
print(sp500_5.shape[0])
Now, just to make sure the type of the object in the 'Date first added' column is a datetime, we print out the dtypes of the dataframe:
sp500_5.dtypes
Similar to how we handled the 'Date first added' column, we will just drop any row which has inconsistent formatting and convert the others into datetime objects. So, first we build the new 'Founded' column:
founded = []
for r in sp500_5.iterrows():
try:
founded += [datetime.strptime(r[1]['Founded'], '%Y')]
except:
founded += [np.NaN]
dropped_companies += [r[1]['Symbol']]
sp500_6 = sp500_5.copy()
sp500_6['Founded'] = founded
sp500_6.head()
And now to remove the rows with NaN values:
# print initial size
print(sp500_6.shape[0])
# remove dropped companies
sp500_7 = sp500_6[~sp500_6['Symbol'].isin(dropped_companies)]
# print new size
print(sp500_7.shape[0])
Just like before, lets take a look at the dtypes to make sure the 'Founded' column has been successfully converted into datetime objects:
sp500_7.dtypes
First, remove unavailable_stocks from sp100
# print inital size
print(sp100.shape[0])
# remove the unavailable stocks
sp100_1 = sp100[~sp100['Symbol'].isin(unavailable_stocks)]
# print new size
print(sp100_1.shape[0])
Now, remove all data points from sp500 and sp500stock which do not share a symbol with sp100:
# print inital size
print(sp500_7.shape[0])
# remove the unavailable stocks
sp500_7_1 = sp500_7[sp500_7['Symbol'].isin(sp100_1['Symbol'])]
# print new size
print(sp500_7_1.shape[0])
# print initial size
print(sp500stocks.shape[0])
# remove dropped companies
sp500stocks_0_1 = sp500stocks[sp500stocks['Symbol'].isin(sp100_1['Symbol'])]
# print new size
print(sp500stocks_0_1.shape[0])
With this, only data for companies from the S&P 100 are left in all of our dataframes.
Before anything else, there's a list of dropped companies which need to be removed from this dataframe:
# print initial size
print(sp500stocks_0_1.shape[0]) # remove _0_1 if 500
# remove dropped companies
sp500stocks_1 = sp500stocks_0_1[~sp500stocks_0_1['Symbol'].isin(dropped_companies)] # remove _0_1 if 500
# print new size
print(sp500stocks_1.shape[0])
Just to be sure, lets compare the unique values of the 'Symbol' column for sp500 and sp500stocks. Since they should have the same unique values, they should be equal after sorting. Since we're working with numpy arrays, a comparison of the sorted arrays should result in an array filled with Trues. Let's verify this:
sp500su = sp500stocks_1['Symbol'].unique()
sp500su.sort()
sp500u = sp500_7_1['Symbol'].unique() # remove _1 if 500
sp500u.sort()
sp500u == sp500su
Now, let's take a look at the features of this dataframe and come up with a checklist for cleaning:
sp500stocks_1.head()
There are several types of issues to address:
The 'OpenInt' column doesn't look too useful based off the the head that we displayed earlier. Let's take a more complete look at it to see what unique values it has:
sp500stocks_1['OpenInt'].unique()
It appears that the 'OpenInt' column is just a column full of zeroes. Since it has no variation at all between any of the stocks, it is completely useless as a feature for distinguishing between them. So, we can drop it:
sp500stocks_2 = sp500stocks_1.drop(columns=['OpenInt'])
sp500stocks_2.head()
Ideally:
Before anything else, let's see what the current types are for this dataframe:
sp500stocks_2.dtypes
So, it looks like all we need to do is to convert the objects in the 'Date' column into datetime objects.
In case anything has an inconsistent format, we will keep a dataframe of inconsistent rows and deal with it as needed. Unlike in the previous dataset where we could simply drop it, we also have to worry about the date here since we don't want any breaks timewise in the data for any given company.
date = []
inconsistent_rows = pd.DataFrame(columns=sp500stocks_2.columns)
#
for r in sp500stocks_2.iterrows():
try:
date += [datetime.strptime(r[1]['Date'], '%Y-%m-%d')]
except:
date += [np.NaN]
inconsistent_rows.append(r)
sp500stocks_3 = sp500stocks_2.copy()
sp500stocks_3['Date'] = date
sp500stocks_3.head()
Just to be sure we've converted everything successfully, let's examine the new types for the dataframe:
sp500stocks_3.dtypes
Next, let's take a look at the inconsistent_rows. Hopefully, it's empty. Otherwise, we'll have some more cleaning to do.
inconsistent_rows.head()
Thankfully, it's empty, which means we're done with the data conversion section of the cleaning.
Next, we need to take a look at the range of dates in this dataframe from each company. The Huge Stock Market Data does not provide the same date range for stock data for all companies. For this project, we need a period of time when all companies have stock data, and preferably a time frame of about 1 year. Depending on the overlapping timeframes for individual companies, this may mean that we need to cut more companies from the two datasets if it comes down to it.
First, lets compile a new dataframe from the data in sp500stocks to find the date ranges for any given stock:
date_ranges = pd.DataFrame()
symbols = []
min_dates = []
max_dates = []
for stock_name in sp500stocks_3['Symbol'].unique():
filtered_rows = sp500stocks_3[sp500stocks_3['Symbol'] == stock_name]
min_date = filtered_rows.iloc[0]['Date']
max_date = filtered_rows.iloc[0]['Date']
for r in filtered_rows.iterrows():
if min_date > r[1]['Date']:
min_date = r[1]['Date']
if max_date < r[1]['Date']:
max_date = r[1]['Date']
symbols += [stock_name]
min_dates += [min_date]
max_dates += [max_date]
date_ranges['Symbol'] = symbols
date_ranges['Earliest Date'] = min_dates
date_ranges['Latest Date'] = max_dates
date_ranges.head()
Now that we have the start and end dates for all these companies, we can take the max of the 'Earliest Date' column and the min of the 'Latest Date' column to find the timeframe in which all companies have stock data.
# max of earliest dates
print(date_ranges['Earliest Date'].max())
# min of latest dates
print(date_ranges['Latest Date'].min())
# print length of timeframe
print(date_ranges['Latest Date'].min() - date_ranges['Earliest Date'].max())
Unfortunately, this timeframe is too short. This means we'll need to start dropping companies until the overlapping timeframe is an appropriate length. Here, we drop companies until the overlapping timeframe is greater than 365 days:
dropped_companies = []
date_ranges_cp = date_ranges.copy()
target_delta_time = dt.timedelta(days=365)
while target_delta_time > date_ranges_cp['Latest Date'].min() - date_ranges_cp['Earliest Date'].max():
max_ed_symbol = date_ranges_cp[date_ranges_cp['Earliest Date'] == date_ranges_cp['Earliest Date'].max()]['Symbol'].iloc[0]
min_ld_symbol = date_ranges_cp[date_ranges_cp['Latest Date'] == date_ranges_cp['Latest Date'].min()]['Symbol'].iloc[0]
if max_ed_symbol == min_ld_symbol:
dropped_companies += [max_ed_symbol]
date_ranges_cp = date_ranges_cp[date_ranges_cp['Symbol'] != max_ed_symbol]
else:
gain_from_ed = date_ranges_cp['Earliest Date'].max() - date_ranges_cp[date_ranges_cp['Symbol'] != max_ed_symbol]['Earliest Date'].max()
gain_from_ld = date_ranges_cp[date_ranges_cp['Symbol'] != min_ld_symbol]['Latest Date'].min() - date_ranges_cp['Latest Date'].min()
if gain_from_ed <= gain_from_ld:
dropped_companies += [max_ed_symbol]
date_ranges_cp = date_ranges_cp[date_ranges_cp['Symbol'] != max_ed_symbol]
else:
dropped_companies += [min_ld_symbol]
date_ranges_cp = date_ranges_cp[date_ranges_cp['Symbol'] != min_ld_symbol]
print(dropped_companies)
len(dropped_companies)
After removing the one company, 'DD', the length of our overlapping timeframe is now:
print(date_ranges_cp['Latest Date'].min() - date_ranges_cp['Earliest Date'].max())
Next, we have to remove all those companies and their stocks from our two main dataframes (sp500 and sp500stocks):
# remove from sp500
# print initial size
print(sp500_7_1.shape[0]) # remove _1 if 500
# remove dropped companies
sp500_8 = sp500_7_1[~sp500_7_1['Symbol'].isin(dropped_companies)] # remove _1 if 500
# print new size
print(sp500_8.shape[0])
# remove from sp500stocks
# print initial size
print(sp500stocks_3.shape[0])
# remove dropped companies
sp500stocks_4 = sp500stocks_3[~sp500stocks_3['Symbol'].isin(dropped_companies)]
# print new size
print(sp500stocks_4.shape[0])
Lastly, we also need to shave off the hanging dates from the various companies so that all of our stock data start and end on the same day for all companies.
max_ed = date_ranges_cp['Earliest Date'].max()
min_ld = date_ranges_cp['Latest Date'].min()
# print initial size
print(sp500stocks_4.shape[0])
# shave data
sp500stocks_5 = sp500stocks_4[(sp500stocks_4['Date'] >= max_ed) & (sp500stocks_4['Date'] <= min_ld)]
# print new size
print(sp500stocks_5.shape[0])
Now that both individual dataframes have been cleaned and refined. we'll merge them together to form the master dataframe for this project. This single dataframe will contain all the data we will need for this project in a tidy format.
First, a review of the shape and features of our two main dataframes which contain all the data we'll use:
print(sp500_8.shape)
sp500_8.head()
print(sp500stocks_5.shape)
sp500stocks_5.head()
For each data point in sp500stocks, we want to append the relevant information from the sp500 dataframe using the Symbol column as the common key.
MASTER_DF = sp500stocks_5.merge(sp500_8, left_on='Symbol', right_on='Symbol')
MASTER_DF.head()
Some code to verify its shape and key characteristics:
# print shape and compare to sp500stocks (should have same number of rows)
print(MASTER_DF.shape)
print(sp500stocks_5.shape)
# get number of unique companies and compare to sp500 (They should be the same)
print(len(MASTER_DF['Symbol'].unique()))
print(len(sp500_8['Symbol'].unique()))
With this, the Master dataframe is complete. In the following sections, all data used will come from this one dataframe, but in order to preserve this tidy dataframe, no modification operations should be performed directly on this dataframe (cause it would be too much trouble to recreate if it is damaged). Copies should be used instead. This dataframe variable is in all caps as a reminder of this usage regulation.
Now that we have our data, we can start considering how to best use this data to build our stock trading robot. In order to best do that, we need to get a better feel for the contents of our dataframe and how they relate with each other.
We'll start by just taking a look at the general stock trends for all of our data points over the time period that we've isolated. Then, we can go on from there to investigate any trends that appear interesting or relevant to consider.
In this section we use MatPlotLib to generate plots. The general documentation can be explored here: https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.plot.html
Let's first create a copy of our Master Dataframe and review the various features it contains:
m_cp_1 = MASTER_DF.copy()
m_cp_1.head()
The features that seem the most relevant for consideration are the Open, High, Low, and Close columns for each date and stock. To get a good feel for how these values change for each stock throughout our selected timeframe, let's first create a new column called Average which just adds the Open and Close and divides by 2. We can then use this Average as a metric for generally examining the change in stock prices over time.
m_cp_1['Average'] = (m_cp_1['Open'] + m_cp_1['Close'])/2
m_cp_1.head()
Now that we have an Average for each datapoint, we can pivot this dataframe by Date and Symbol to focus on the values of Average for each Date and Symbol, and graph it accordingly.
# Pivot dataframe
m_cp_1_p = m_cp_1.pivot(index='Date', columns='Symbol', values='Average')
m_cp_1_p.head()
# graph pivoted dataframe
m_cp_1_p.plot(title="Stock Average per Company as a function of time", legend=False, ylabel="Stock Price", figsize=(20,15))
From this graph, it looks like there are several outliers which seem to mirror each other in their growth trends. Most of the stock prices are below 200 in price. It appears that there might be a general upward trend in stock prices over time, but it's hard to say for sure from just this graph due to to heavy concentration of lines below the 200 mark.
m_cp = MASTER_DF.copy()
m_cp.head()
Additional features that seem like they would be relevant would include the Sector and Growth, since some sectors may have been more profitable overall than other sectors. And Growth represents the percentage increase in stock value day to day. The companies that have high Growth represent the companies that were profitable and this is what our machine is trying to predict. In order to calculate Growth we will take (Close / Open - 1) * 100%.
m_cp['Growth'] = (m_cp['Close'] / m_cp['Open'] - 1) * 100
m_cp.head()
Additionally we would like to calculate the Compounded Growth, which we will calculate by: $$\frac{100\%}{\mbox{Start Date Open Price}}\sum_{i=\mbox{start date}}^{\mbox{end date}}[i\mbox{th Date Close Price} - \mbox{Start Date Open Price}]$$
Basically, we are modeling how the stock prices on any given day compare to their initial stock price value, and we standarize it to their initial stock price value.
# groupby will group the dataframe based on unique symbol values
# apply will modify the value of each grouping based on formula provided
compounded_growth = m_cp.groupby('Symbol').apply(lambda x: (x['Close'] - float(x['Open'].head(1))) / float(x['Open'].head(1)))
# reset_index undoes the grouping, returning to original dataframe (keeping modifications)
compounded_growth = compounded_growth.reset_index()
m_cp['Compounded Growth'] = compounded_growth['Close']
m_cp.head()
Now we will average (mean) the Compounded Growth across all companies within the same Sector. To do so, we will calculate:
$$\frac{1}{\mbox{# of Companies in the Sector}}\sum_{\mbox{Companies in the Sector}}^{}[\mbox{Compounded Growth of Company}]$$for each sector
sectorlst = m_cp['GICS Sector'].drop_duplicates()
daterange = m_cp['Date'].drop_duplicates()
growthplt = pd.DataFrame({'Date': daterange})
# Calculating unique average sector growth for each sector
for sector in sectorlst:
m_sector = m_cp.loc[m_cp['GICS Sector'] == sector]
sector_growth = []
# Calculating the average sector growth for each day
# by summing up compounded growth for every company
# of a given sector for that given day
for date in daterange:
m_sector_day = m_sector.loc[m_cp['Date'] == date]
average_growth = m_sector_day['Compounded Growth'].sum() / len(m_sector_day)
sector_growth.append(average_growth)
growthplt[sector] = sector_growth
growthplt.head()
growthplt = growthplt.drop(columns = ['Date'])
Now we shall plot the sector growth vs Date
growthplt.plot(figsize = (30, 12))
plt.xlabel('Days')
plt.ylabel('Average Sector Growth')
plt.title('Average Sector Growth over Time per Sector')
plt.show()
We will notice that the growth of each sector seems closely similar to each other; they seem to all fall under a similar trend (excluding energy). However, you should not be alarmed, since this still falls within our expectations. Typically speaking, the stock market reflects the economic situation of a nation. Therefore fluctuations within one sector will be closely modeled in the fluctuation within another sector. The exception to this is when some sector experiences more growth due to external factors, such as technological advancements and scientific breakthroughs. Since the energy sector seems to diverge from the trend; this might indicate that some breakthrough in energy production occurred during this time period or some national policy (such as the deregulation of this sector) improved the performance of this sector .
Now lets look at how companies within the same Sector compare to each other. We will be creating 10 separate plots for each Sector, with the inclusion of the industry average for comparison in black.
def make_sector_plot(sector):
pplt = m_cp.loc[(m_cp['GICS Sector'] == sector)].pivot(index = 'Date', columns = 'Symbol', values = 'Compounded Growth')
pplt.plot(figsize = (30, 12))
# plotting the average industry growth for comparison (in black so it stands out)
plt.plot(daterange, growthplt[sector], color = 'black')
plt.title(sector)
plt.xlabel('Date')
plt.ylabel('Compounded Growth')
plt.show()
# reindexing so that ith index corresponds to ith sector
sectorlst.index = range(len(sectorlst))
make_sector_plot(sectorlst[0])
make_sector_plot(sectorlst[1])
make_sector_plot(sectorlst[2])
make_sector_plot(sectorlst[3])
make_sector_plot(sectorlst[4])
make_sector_plot(sectorlst[5])
make_sector_plot(sectorlst[6])
make_sector_plot(sectorlst[7])
make_sector_plot(sectorlst[8])
make_sector_plot(sectorlst[9])
We will notice immediately that the reason behind the massive average industry growth of the energy sector, is because of the company OXY. This might indicate this company is highly profitable, or that this company is overrated. Furthermore we can notice how some companies are underperforming within their sector; this may indicate that they are losing in profitiability or that they are underrated.
Additionally, we learn that the profitability of companies within different sector vary significantly. We find that utilities and real estate experience the least growth, whereas energy experiences significant growth. From this we are drawn to believe that the sector a company is from will influence the profitability of that company.
The Volume of stocks traded has also been noted to indicate shifts in trader interest in a company. For example, when there is a spike in Volume this could be due to panic that a company has loss profitability or has gained in profitability, so everyone is scrambling to buy or sell stocks. Therefore we will shall create a distribution plot of the Growth vs Volume. First we will define interval ranges of 10 million for the Volume of stocks traded.
interval_range = 10000000
m_cp['Volume Range'] = m_cp['Volume'].apply(lambda x: int(x / interval_range))
m_cp.head()
Now we build distribution plots, to do this we will use python's Seaborn library. For more information on building violin plots https://seaborn.pydata.org/generated/seaborn.violinplot.html
sns.set(rc = {'figure.figsize':(30,8)})
ax = sns.violinplot('Volume Range', 'Growth', data = m_cp)
ax.set(xlabel = 'Volume of Stocks Traded (in 10 millions)', ylabel = 'Growth', title = 'Distribution of Growth Based on Volume of Stocks Traded')
plt.show()
The variance and range of the distribution seems to increase as the volume of stocks traded increases. However, this may be due to fewer data points for large volume trading.
Therefore we will make a scatter plot for each individual data point. We will use the matplotlib library to do this. To learn more visit this site https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html
plt.scatter('Volume', 'Growth', data = m_cp)
plt.xlabel('Volume of Stocks Traded')
plt.ylabel('Growth')
plt.title('Scatter Plot of Growth vs Volume of Stocks Traded')
plt.show()
As we had feared, the variation can largely be explained by the lack of data points. However, we might find a more meaningful trend when we look instead at the change in volume of stocks traded. In order to calculate this we will just shift the volumes by 1 day and subtract out the original copy.
# After grouping, volume[i + 1] - volume[i] is assigned to change_volume[i + 1] for all i
change_volume = m_cp.groupby('Symbol').apply(lambda x: x['Volume'].shift(1) - x['Volume'])
change_volume = change_volume.reset_index()
m_cp['Volume Change'] = change_volume['Volume']
m_cp.head()
Time to build the scatter plot of Growth vs Volume Change
plt.scatter('Volume Change', 'Growth', data = m_cp)
plt.xlabel('Change in Volume of Stocks Traded')
plt.ylabel('Growth')
plt.title('Scatter Plot of Growth vs Change in Volume of Stocks Traded')
plt.show()
The data seems to indicate the opposite of what our intuition predicted. Large change in volume of trade are typically met with low growth, excluding a few exceptions. However, the largest growth or loss are observed in those with little to no change in volume of stocks traded. From this we find that there seems to be no correlation between growth and volume.
Now to set up a precursor for later analysis, we will average out the growth across all companies, for a metric of how well we would like to perform over this period of time, when testing our model.
In order to do this we will calculate: $$\frac{1}{\mbox{# of Companies}}\sum_{\mbox{All Companies}}^{}[\mbox{Compounded Growth of Company}]$$
This formula should look familiar, since its the same one used to calculate the average per sector, but instead we now want the average for the S&P100.
all_growth = []
# summing of compounded growth of all companies for each day to calculate
# average industry growth
for date in daterange:
m_sector_day = m_sector.loc[m_cp['Date'] == date]
average_growth = m_sector_day['Compounded Growth'].sum() / len(m_sector_day)
all_growth.append(average_growth)
growthplt['All'] = all_growth
growthplt.head()
Now we plot this against time.
plt.plot(daterange, growthplt['All'])
plt.xlabel('Date')
plt.ylabel('Compounded Growth')
plt.title('Average Growth of S&P100 over Time Period')
plt.show()
And the exact measure of how much the S&P100 grew over our designated time period, would just be the average Compounded Growth value on the end of this period.
growthplt['All'][-1:]
Which would be around 9%. We will also calculate the average magnitude of Growth each day, to approximate how much profit correct predictions vs incorrect predictions in the market would be.
In order to calculate this we will: $$\frac{1}{\mbox{# of Entries}}\sum_{\mbox{All Entries}}^{}[\mbox{|Growth of Entry|}]$$
sum_magnitude = 0
for i in m_cp['Growth']:
sum_magnitude += abs(i)
print('Sum magnitude of growth: ', sum_magnitude)
print('Average magnitude of growth: ', sum_magnitude / len(m_cp))
From this we find that for each correct prediction we expect to see an increase in profit of around 0.8%.
In this section, we will apply a neural network to predict the opening price of any given stock given the open, high, low, closing, and volume of that stock in previous days. The model used here is based on the TensorFlow Kertas library tutorial, which is linked here: https://www.tensorflow.org/guide/keras/sequential_model
Since the neural net (NN) can only take a set number of inputs, we will have to choose a limit for the backlog of stock data that we feed into the model. For now, we'll set this limit to be 20 days.
backlog = dt.timedelta(days=20)
backlog.days
Now that we've set the backlog, let's consider how the data will be passed in and what sort of output we should expect. This will help us determine the shape of our model.
For input, for each company, there will be 20 days worth of data with each day containing the five metrics: open, high, low, close, and volume. This means that for each company, we'll need 100 nodes in the input layer. Taking a look at the number of companies we have:
len(m_cp['Symbol'].unique())
We have 78 companies, so the input layer will need to have 7810 nodes as input. We have an extra 10 nodes to denote the industry of the company whose stock we're interested in.
For output, we'll just have 2 nodes, one holding the probability that that company's opening stock will have increased and one holding the probability that that company's opening stock did not increase.
So, based on all of that, we can build our model:
model = keras.Sequential([
keras.layers.Dense(backlog.days*5*len(m_cp['Symbol'].unique()) + 10), # input layer (1)
keras.layers.Dense(len(m_cp['Symbol'].unique())*backlog.days, activation='relu'), # hidden layer (2)
keras.layers.Dense(2, activation='softmax') # output layer (3)
])
model.compile(optimizer='adam',
loss='sparse_categorical_crossentropy',
metrics=['accuracy'])
Notice that we also have a hidden layer between our input and output layer. In this case, we just arbitrarily decided that this hidden layer should have 1560 nodes.
All the layers are densely connected, which means that every node fomr the previous layer is connected to every node in the next layer.
With this, our model is build and ready to be trained.
Now that we have our model, we need to prepare our data to fit in into our model. More specifically we need to take the data from our master dataframe and split it into a set of x and y values which correspond with the input and output of the model. Let's start by first making a copy of the Master dataframe to work with:
m_cp = MASTER_DF.copy()
m_cp.head()
All values that the NN accepts as input and outputs are within the [0,1]. But our data potentially spans the entire real number line. In order to bring everything closer to the [0,1] range, we'll create normalized versions of the columns we'll use:
m_cp['N_Open'] = (m_cp['Open'].mean() - m_cp['Open']) / m_cp['Open'].std()
m_cp['N_High'] = (m_cp['High'].mean() - m_cp['High']) / m_cp['High'].std()
m_cp['N_Low'] = (m_cp['Low'].mean() - m_cp['Low']) / m_cp['Low'].std()
m_cp['N_Close'] = (m_cp['Close'].mean() - m_cp['Close']) / m_cp['Close'].std()
m_cp['N_Volume'] = (m_cp['Volume'].mean() - m_cp['Volume']) / m_cp['Volume'].std()
m_cp.head()
Now, everything is closer to [0,1], but not completely within the acceptable range. So, we'll use a sigmoid encoding to force everything to fit within the appropriate range.
m_cp['SE_N_Open'] = 1 / (1 + np.exp(-m_cp['N_Open']))
m_cp['SE_N_High'] = 1 / (1 + np.exp(-m_cp['N_High']))
m_cp['SE_N_Low'] = 1 / (1 + np.exp(-m_cp['N_Low']))
m_cp['SE_N_Close'] = 1 / (1 + np.exp(-m_cp['N_Close']))
m_cp['SE_N_Volume'] = 1 / (1 + np.exp(-m_cp['N_Volume']))
m_cp.head()
Now that all the considered data is within the appropriate range, we will pivot this data by date and symbol to focus on the values we care about:
m_cp_p = m_cp.pivot(index='Date', columns='Symbol', values=['SE_N_Open', 'SE_N_High', 'SE_N_Low', 'SE_N_Close', 'SE_N_Volume'])
m_cp_p.head()
Now, let's filter out the corresponding x's and y's. Each y will contain either a 0 or a 1, representing the presencse of an increase in opening stock price. For each element in y, there will be a corresponding array in x of length 7810 containing the encoding for the industry and the 20 previous days of market data.
But, first, we need to pick a company from our 78 options:
options = m_cp['Symbol'].unique()
options.sort()
options
We decided to just go with 'AAPL' for our tutorial.
target_company = 'AAPL'
Here, we'll also do some prep work for encoding industries:
def encode_target_industry():
target_industry = m_cp[m_cp['Symbol'] == target_company]['GICS Sector'].unique()[0]
industries = m_cp['GICS Sector'].unique()
for i in range(len(industries)):
if industries[i] == target_industry:
break
encoding = np.zeros(10)
encoding[i] = 1
return encoding
Now, time to prepare the x and y data from the dataframe:
xs = []
ys = []
for i in range(m_cp_p.shape[0] - 20):
### prepare xs
x = np.array([])
# handle industry stuff
x = np.append(x, encode_target_industry())
# handle prev 20 days data
for j in range(20):
params = np.append(
[m_cp_p.iloc[i + j]['SE_N_Open']],
np.append(
[m_cp_p.iloc[i + j]['SE_N_High']],
np.append(
[m_cp_p.iloc[i + j]['SE_N_Low']],
np.append(
[m_cp_p.iloc[i + j]['SE_N_Close']],
[m_cp_p.iloc[i + j]['SE_N_Volume']],
)
)
)
)
x = np.append(x, params)
xs += [x]
# prepare ys
if m_cp_p.iloc[i + 20]['SE_N_Open']['AAPL'] > m_cp_p.iloc[i + 19]['SE_N_Close']['AAPL']:
ys += [1]
else:
ys += [0]
np_xs = np.array(xs)
np_ys = np.array(ys)
Now, let's verify that the chape for both np_xs and np_ys are correct:
np_xs.shape
np_ys.shape
The shape is as expected and we can see that we have 576 sets of x y data. From this, we'll choose a certain portion to be training data and a certain portion to be testing data. Preferably, we would like the training data to be ~10 times larger than the testing data.
split_index = math.floor(np_ys.shape[0]/11*10)
train_xs = np_xs[:split_index]
train_ys = np_ys[:split_index]
test_xs = np_xs[split_index:]
test_ys = np_ys[split_index:]
Now that we have our model, training data, and testing data, we can move on to fitting our model.
The TensorFlow library handles all the fitting for us, and will follow the shape and optimization configurations that we established in the designing model section. The only new thing we need to input is the number of epochs, which is just the number of times the model goes through our training data in the learning process.
But first, let's verify the shape of the training sets:
train_xs.shape
train_ys.shape
And now, to fit the model:
model.fit(train_xs, train_ys, epochs=80)
We will run a rudimentary test on our testing data to measure how well our model predicts stocks.
test_loss, test_acc = model.evaluate(test_xs, test_ys, verbose=1)
print('Test accuracy:', test_acc)
Although our accuracy is pretty decent of around 0.566, this may be due to chance. It should be noted that if the model can predict with accuracy of around 0.566 then that means there are approximately 13 more correct predictions than incorrect prediction for every 100 prediction. Then we should expect to see around 95 additional correct predictions for AAPL over this 2 year period. If we assume correct and incorrect predictions yield approximately equal gain/loss. Then with an average gain of 0.8% per prediction, we calculate (1.008)^95 is approximately 2.132. This far exceeds the market of only 1.09. If we take into account a high tax rate of around 20% on our profit. Then we'll get an average yield of 0.64%, so a net profit of around (1.0064)^95 or approximately 1.833. In theory we would have beaten the market, but it is very likely that we just got lucky in this instance. The formula for approximating the profitability based on accuracy was used:
$$(1 + (\mbox{Approximate Growth} * (1 - \mbox{Tax Rate}))^{((2 * \mbox{Accuracy} - 1) * \mbox{#Days})}$$For now we will calculate a 10-fold validation test. For more information about K-Fold Validation visit this https://machinelearningmastery.com/k-fold-cross-validation/
number_folds = 10
# # of iterations we train our model over the provided training set
number_epochs = 20
kfolds = np.array_split(np_xs, number_folds)
# slices represents the index for how the k folds will be partitioned
slices = [0]
index = 0
for i in range(10):
index += len(kfolds[i])
slices.append(index)
scorelst = np.empty(number_folds)
# runs the k iterations
for i in range(number_folds):
training_data = np.append(np_xs[:slices[1]], np_xs[slices[i + 1]:], axis = 0)
training_target = np.append(np_ys[:slices[1]], np_ys[slices[i + 1]:], axis = 0)
testing_data = np_xs[slices[i]:slices[i + 1]]
testing_target = np_ys[slices[i]:slices[i + 1]]
model.fit(training_data, training_target, epochs = 20)
loss, score = model.evaluate(testing_data, testing_target, verbose = 1)
scorelst[i] = score
for i in range(len(kfolds)):
print(f'Score of {i}th fold' , scorelst[i])
print('Average accuracy of Model: ', scorelst.mean())
From our 10 fold validation we find that the average accuracy is quite high around 0.545. Using above method for calculation, we get an expected profitability of around 1.52. This may seem to indicate that our model is quite good, but the reality is that our data is outdated and a lot of the results were approximated. The actual calculations are far more complicated and is beyond the scope of this tutorial. If you want to learn more about some of the common trading practices visit here https://www.investopedia.com/options-basics-tutorial-4583012#:~:text=A%20call%20option%20gives%20the,right%20to%20sell%20a%20stock.
I hope that after this tutorial you have developed a better understanding of how to train a machine to make predictions on the stock market. In all likelihood machines will make better predictions than humans, at least with regard to short term investments. As our computational speeds increases, our machine learning algorithms can be trained on larger datasets with far more features, increasing the accuracy of their predictions. It is very likely that the presence of computer trader will only increase with technological advancements. When that happens, it may very well become a race for computational power to determine those that will 'beat the market' and the rest that will fail.