Analysing Stock Data Collected from Polygon.io API

Fang Wu

Required Packages
Define User-friendly Functions to Query Data
- Reference Data Endpoints
- Market Data Endpoints
EDA

This document is a vignette to explain how I collect and perform basic EDA on stock data from Polygon.io API. In the first session, I built a few user-friendly functions to interact with the Polygon API to query, parse, and return well-formatted data. In the second session, I performed a basic exploratory data analysis (EDA) on historical stock data for companies – Apple, Google, Tesla, and Zoom.

Required Packages

I used the following packages to interact with API and do EDA:

httr: importing files from Web APIs.
dplyr: manipulating data.
jsonlite: converting JSON data to R objects.
ggplot2: visualizing the data.
lubridate: converting string ‘date’ to R date object.
zoo: using function rollmean() to calculate the moving price.
tidyr: changing the shape of the tibble.
GGally: plotting correlation between variables.
purrr: combining data frame using reduce() function

library(httr)
library(dplyr)
library(jsonlite)
library(ggplot2)
library(lubridate)
library(zoo)
library(tidyr)
library(corrplot)
library(GGally)
library(purrr)

Define User-friendly Functions to Query Data

The Polygon.io API provides REST endpoints that let companies and developers query the latest market data for stocks, options, forex, and crypto. For each part, there are market data endpoints and reference data endpoints. In this vignette, I will focus on stock market.

In order to interact with Polygon API and return well-formatted data, I am going to define some user-friendly functions, which means that users can customize their query to return specific data. I am going to query adjusted price as default in the following functions, since adjusted price incorporates events like splits and dividends distribution, which can affect the series.

Reference Data Endpoints

Ticker Types

All ticker types that Polygon.io has are listed in this endpoints. I define a function called type_info to connect with this endpoints to return a tibble with type description, code used in Polygon.io, asset_class, and locale.

type_info <- function(class=NULL, locale=NULL){
    mainURL <- "https://api.polygon.io/v3/reference/tickers/types?"
    apikey <- "&apiKey=dJT0WQZ7GwH45bAZ8TBZT3KusMgjNJM2"
    if (! is.null(class)) {class <- paste0("asset_class=", class) }
    if (! is.null(locale)) {locale <- paste0("&locale=", locale) }
    ticker_types <- GET(paste0(mainURL, class, locale,  apikey))
    types_data <- fromJSON(rawToChar(ticker_types$content))
    types <- types_data$results %>% as_tibble() %>% rename(type=description )
    return(types)
}

#check the function to query all types in US stock market 
type_data <- type_info(class="stocks")
type_data

## # A tibble: 21 x 4
##    code    type     asset_class locale
##    <chr>   <chr>    <chr>       <chr> 
##  1 CS      Common ~ stocks      us    
##  2 PFD     Preferr~ stocks      us    
##  3 WARRANT Warrant  stocks      us    
##  4 RIGHT   Rights   stocks      us    
##  5 BOND    Corpora~ stocks      us    
##  6 ETF     Exchang~ stocks      us    
##  7 ETN     Exchang~ stocks      us    
##  8 SP      Structu~ stocks      us    
##  9 ADRC    America~ stocks      us    
## 10 ADRW    America~ stocks      us    
## # ... with 11 more rows

In this tibble, code is what Polygon.io uses to refer to type. User can check up the abbreviation for each ticker type through this function.

Then I define another helper function convert_to_code to help user convert type name to code used in polygon.io. So users can use either abbreviation or full type name in other data queries.

convert_to_code <- function(text, class=NULL, locale=NULL){
    types_info <- type_info(class, locale)
    if (text %in% types_info$code){
        return(text)
    } else if  (tolower(text) %in% tolower(types_info$type)){
        return(types_info$code[tolower(types_info$type)==tolower(text)] )
    } else {print("No such type supported by polygon.io")}
}
#check this helper function 
ADRC <- convert_to_code(text="American Depository Receipt Common")
ADRC

## [1] "ADRC"

Tickers

I am going to define a function called tickers_supported to query ticker symbols which are supported by Polygon.io. This function allow user specify the type of the tickers, market type, and if the tickers returned should be actively traded on the queried date. With the helper function convert_to_code, user can type the abbreviation for the ticker type or description.

tickers_supported <- function(type=NULL, market=NULL, active=TRUE){
    mainURL <- "https://api.polygon.io/v3/reference/tickers?"
    apikey <- ""
    limit <- "limit=1000"
    if (! is.null(type)) {type <- paste0("type=",convert_to_code(text=type),"&")}
    if (! is.null(market)) {market <- paste0("market=",market,"&")}
    active <- paste0("active=",active,"&")
    option <- paste0(type,market,active,limit)
    next_url<- TRUE
    tickers_info <- NULL
    while (! is.null(next_url)) {
        tikers_raw <- GET(paste0(mainURL, option, apikey))
        tickers_data <- fromJSON(rawToChar(tikers_raw$content))
        tickers <- tickers_data$results %>% as_tibble()
        tickers_info <- bind_rows(tickers_info, tickers)
        next_url <- tickers_data$next_url
        option <- next_url
    }
    return(tickers_info)
}
#require all ADRC ticker symbols
ADRC_tickers <- tickers_supported(type="American Depository Receipt Common")
ADRC_tickers

## # A tibble: 1,100 x 12
##    ticker name     market locale type 
##    <chr>  <chr>    <chr>  <chr>  <chr>
##  1 AAALY  AAREAL ~ otc    us     ADRC 
##  2 AACAY  AAC TEC~ otc    us     ADRC 
##  3 AACG   ATA Cre~ stocks us     ADRC 
##  4 AAGIY  AIA GRO~ otc    us     ADRC 
##  5 AAGRY  ASTRA A~ otc    us     ADRC 
##  6 AARTY  AIRTEL ~ otc    us     ADRC 
##  7 AAVMY  ABN AMR~ otc    us     ADRC 
##  8 AAYYY  AUSTRAL~ otc    us     ADRC 
##  9 ABB    ABB Ltd. stocks us     ADRC 
## 10 ABCM   Abcam p~ stocks us     ADRC 
## # ... with 1,090 more rows, and 7
## #   more variables: active <lgl>,
## #   currency_name <chr>,
## #   composite_figi <chr>,
## #   share_class_figi <chr>,
## #   last_updated_utc <chr>,
## #   primary_exchange <chr>, ...

Market Data Endpoints

Aggregates (Bars)

This endpoint provides aggregate bars for a stock over a given date range in custom time window sizes.

First I define a function called stock_price to query data for one specific ticker and return a tibble with more meaningful column names. User can customize the date range and time window in this function. Then I use function lapply and stock_price to generate a tibble for multiple specified tickers. A wrapper function is provided at the end.

stock_price <- function(ticker, multiplier, timespan, from, to){
    mainURL <- "https://api.polygon.io/v2/aggs"
    apikey <- "?limit=1000"
    ticker_sym <- paste0("/ticker/", ticker)
    range <- paste0("/range/", multiplier,"/", timespan)
    from <- paste0("/", from)
    to <- paste0("/", to)
    stock_raw <- GET(paste0(mainURL, ticker_sym, range, from, to, apikey))
    stock_data <- fromJSON(rawToChar(stock_raw$content))
    stock <- stock_data$results %>% as_tibble() %>% 
        rename(close=c, highest=h, lowest=l, open=o, volume=v, 
               volume_weighted_average_price=vw) %>% 
        mutate(ticker=ticker, date=as.Date(as.POSIXct(t/1000, origin = "1970-01-01"))) %>% 
        select(ticker, date, close, highest, lowest, open, volume)
    return(stock)
}

# check to query one ticker data
AAPL_price <- stock_price("AAPL", "1", "day", "2021-06-01", "2022-06-01")
AAPL_price

## # A tibble: 254 x 7
##    ticker date       close highest
##    <chr>  <date>     <dbl>   <dbl>
##  1 AAPL   2021-06-01  124.    125.
##  2 AAPL   2021-06-02  125.    125.
##  3 AAPL   2021-06-03  124.    125.
##  4 AAPL   2021-06-04  126.    126.
##  5 AAPL   2021-06-07  126.    126.
##  6 AAPL   2021-06-08  127.    128.
##  7 AAPL   2021-06-09  127.    128.
##  8 AAPL   2021-06-10  126.    128.
##  9 AAPL   2021-06-11  127.    127.
## 10 AAPL   2021-06-14  130.    131.
## # ... with 244 more rows, and 3 more
## #   variables: lowest <dbl>,
## #   open <dbl>, volume <dbl>

# generate tibble about multiple companies
tickers_interest=c("AAPL","ZM","GOOGL","TSLA")
stock_data <- lapply(X=tickers_interest, FUN=stock_price, multiplier="1",
                         timespan="day",from="2021-06-01", to="2022-06-01") %>%
    reduce(bind_rows)
stock_data

## # A tibble: 1,016 x 7
##    ticker date       close highest
##    <chr>  <date>     <dbl>   <dbl>
##  1 AAPL   2021-06-01  124.    125.
##  2 AAPL   2021-06-02  125.    125.
##  3 AAPL   2021-06-03  124.    125.
##  4 AAPL   2021-06-04  126.    126.
##  5 AAPL   2021-06-07  126.    126.
##  6 AAPL   2021-06-08  127.    128.
##  7 AAPL   2021-06-09  127.    128.
##  8 AAPL   2021-06-10  126.    128.
##  9 AAPL   2021-06-11  127.    127.
## 10 AAPL   2021-06-14  130.    131.
## # ... with 1,006 more rows, and 3
## #   more variables: lowest <dbl>,
## #   open <dbl>, volume <dbl>

Wrapper function to get the same data.

# wrapper function
get_price <- function(tickers, num, span, start, end){
    stock_data <- lapply(X=tickers, FUN=stock_price, multiplier=num,
                         timespan=span, from=start, to=end) %>% reduce(bind_rows)
    return(stock_data)
}

#get_price(tickers=c("AAPL"), num="2",span="month",start="2021-01-01", end="2022-06-01" )

Grouped Daily (Bars)

This endpoint provides the daily open, high, low, and close price information for the entire stocks/equities markets.

I am going to define a function named one_day to connect the API and return a tibble. Users can specify the date through this function.

one_day <- function(date){
    mainURL <- "https://api.polygon.io/v2/aggs/grouped/locale/us/market/stocks/"
    apikey <- "?adjusted=true"
    date_raw <- GET(paste0(mainURL, date, apikey))
    date_data <- fromJSON(rawToChar(date_raw$content))
    date_info <- date_data$results %>% as_tibble() %>% rename( close=c,
                      highest=h, lowest=l, ransactions=n, open=o, volume=v,) %>%
                     select(T, open, close, highest, lowest, volume)
    return(date_info)
}
#check entire market on 2022-06-21
Jun21 <- one_day(date="2022-06-21")

## Error in `chr_as_locations()`:
## ! Can't rename columns that don't exist.
## x Column `c` doesn't exist.

Jun21

## # A tibble: 11,159 x 6
##    T      open  close highest lowest
##    <chr> <dbl>  <dbl>   <dbl>  <dbl>
##  1 HEGD  17.0  17.1    17.1    17.0 
##  2 BAMI  16.8  17.0    17.0    16.8 
##  3 SYFpA 18.6  18.5    18.8    18.4 
##  4 FAZ   29.6  29.8    30.2    29.3 
##  5 BWG    8.5   8.53    8.57    8.43
##  6 DFSV  22.7  22.8    22.9    22.4 
##  7 MSN    0.63  0.639   0.642   0.62
##  8 HAL   32.7  33.6    34.0    32.4 
##  9 TWLO  85.1  85.2    87.9    84.3 
## 10 PDOT   9.78  9.8     9.8     9.78
## # ... with 11,149 more rows, and 1
## #   more variable: volume <dbl>

EDA

I want to explore daily stock price changes for companies Apply, Google, Tesla, and Zoom from 2001-06-01 to 2022-06-01. So I will use tibble stock_data gained in the last session.

I am going to perform basic EDA on close price, returns, and moving price to explore some interesting trends and relationships.

Analysis on Close Price

Let’s find some trends of the daily close price.

g <- ggplot(stock_data, aes(x=date, y=close, color=ticker))
g + geom_line() +
    labs(x="Date", y="Close Price", title="Series of Daily Close Price") +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_x_date(date_labels = "%b %y", date_breaks="2 months") 

We can find: Google’s stock is much more expensive than the others’; ZM’s close price continued to drop down; GOOGL and TSLA’s close price waved a lot in the past year; AAPL’s price appears to not deviate much in the plot;

Because of the large price difference, I am not sure whether the stability of AAPL is ture.

In addition, I am going to get some descriptive statistics for the close price movement. Compare their standard deviation and IQR to check their volatility.

summary_table <- stock_data %>% group_by(ticker) %>% summarise(q1=quantile(close, 0.25), mean=mean(close), median=median(close), q3=quantile(close, 0.75), sd=sd(close), IQR=IQR(close))
summary_table

## # A tibble: 4 x 7
##   ticker    q1  mean median    q3
##   <chr>  <dbl> <dbl>  <dbl> <dbl>
## 1 AAPL    146.  155.   152.  167.
## 2 GOOGL  2539. 2682.  2725. 2843.
## 3 TSLA    711.  859.   842. 1009.
## 4 ZM      123.  225.   207.  315.
## # ... with 2 more variables:
## #   sd <dbl>, IQR <dbl>

In this table we can find that the standard deviation and IQR of AAPL is indeed much lower than the others’. The mean and median of GOOGL is indeed much higher than the others.

We can also check the spread by the following box plot.

g <- ggplot(stock_data, aes(x=ticker, y=close))
g + geom_boxplot(aes(color=ticker)) +
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(y="close price", title="Box Plot for Close Price Across Tickers") 

The boxes of GOOGL and TSLA are similar, while ZM’s box is smaller. Meanwhile, the box of AAPL is like a line in the plot, which indicates that close price of AAPL is very stable.

Analysis on Simple Returns

To solve the problem caused by the high price difference, we will analyze the simple returns instead of close prices in this part.

First, I am going to define a function to calculate returns and create a new tibble named return_data.

get_returns <- function(data_set){
    returns <- function(x){
        n=length(x)
        returns <- c(NA, (x[2:n]-x[1:(n-1)])/(x[1:(n-1)]))
        return(returns)
    }
    new_data <- data_set %>% group_by(ticker) %>% mutate(return=returns(close)) %>% 
        select(ticker, date, close, return, volume) %>% ungroup()
    return(new_data)
}
#get new tibble
return_data <- get_returns(stock_data)
return_data

## # A tibble: 1,016 x 5
##    ticker date       close     return
##    <chr>  <date>     <dbl>      <dbl>
##  1 AAPL   2021-06-01  124. NA        
##  2 AAPL   2021-06-02  125.  0.00628  
##  3 AAPL   2021-06-03  124. -0.0122   
##  4 AAPL   2021-06-04  126.  0.0190   
##  5 AAPL   2021-06-07  126.  0.0000794
##  6 AAPL   2021-06-08  127.  0.00667  
##  7 AAPL   2021-06-09  127.  0.00308  
##  8 AAPL   2021-06-10  126. -0.00802  
##  9 AAPL   2021-06-11  127.  0.00983  
## 10 AAPL   2021-06-14  130.  0.0246   
## # ... with 1,006 more rows, and 1
## #   more variable: volume <dbl>

Now let’s calculate the descriptive statistics of returns.

return_table <- return_data %>% group_by(ticker) %>% summarise(q1=quantile(return, 0.25, na.rm=TRUE), mean=mean(return, na.rm=TRUE), median=median(return, na.rm=TRUE), q3=quantile(return, 0.75, na.rm=TRUE), sd=sd(return, na.rm=TRUE), IQR=IQR(return, na.rm=TRUE))
return_table

## # A tibble: 4 x 7
##   ticker       q1       mean    median
##   <chr>     <dbl>      <dbl>     <dbl>
## 1 AAPL   -0.00910  0.000866   0.00114 
## 2 GOOGL  -0.00922 -0.0000121  0.000963
## 3 TSLA   -0.0174   0.00136    0.00201 
## 4 ZM     -0.0244  -0.00375   -0.00476 
## # ... with 3 more variables:
## #   q3 <dbl>, sd <dbl>, IQR <dbl>

I want visulize this table in the followin gplot.

g <- ggplot(return_table, aes(x=ticker))
g + geom_bar(aes(y=mean), stat="identity", fill="dark blue")+
    labs(y="mean return", title="Mean of the Daily Returns")

The mean return of GOOGL over the past year is so close to 0, while it is negative for ZM and positive for AAPL and TSLA.

g <- ggplot(return_table, aes(x=ticker))
g + geom_bar(aes(y=sd), stat="identity", fill="dark blue")+
    labs(y="return's sd", title="Standard Deviation of the Daily Returns")

Which is very surprising to me is that ZM has a relatively high deviation and GOOGL has a relative low deviation in terms of returns.

We also check the spreadin the box plot.

g <- ggplot(return_data, aes(x=ticker, y=return))
g + geom_boxplot(aes(color=ticker)) +
    theme(plot.title = element_text(hjust = 0.5)) +
    labs(y="return", title="Box Plot for Returns Across Tickers") 

From this plot we can see clearly that the medians are all pretty close to zero, and the variations are similar between AAPL and GOOGL, TSLA and ZM separately. There seems like be more far away outliers in TSLA and ZM than in AAPL and GOOGL. Two much lower values in the ZM may account for some part of its negative mean return.

Now I want to cut returns into different levels and check how many records fallen down within each level for each ticker.

new <- return_data %>% mutate(rlev=cut(return, breaks=5, labels=c("very bad", "bad", "close 0", "good", "very good")))
table(new$ticker, new$rlev)

##        
##         very bad bad close 0 good
##   AAPL         0   3     198   52
##   GOOGL        0   2     209   41
##   TSLA         3  24     141   78
##   ZM           2  21     169   55
##        
##         very good
##   AAPL          0
##   GOOGL         1
##   TSLA          7
##   ZM            6

From this contingency table, we can find that there are both some very bad and very good values in TSLA and ZM, which indicates the large daily changes exist.

Let’s check their histogram.

g <- ggplot(return_data, aes(x=return))
g + geom_histogram( fill="darkblue") +
    labs(y="Daily Return", title="Histogram of Daily Return") +
    theme(plot.title = element_text(hjust = 0.5)) +
    facet_wrap(~ticker)

As expected, lots of activity appear in the middle, however the plots spread out to the left and right far for TSLA and ZM. These two plots have fat tails as indicated in the contingency table.

Now let’s check the trend of the return.

g <- ggplot(return_data, aes(x=date, y=return))
g + geom_line(aes(color=ticker)) +
    labs(y="Daily Return", title="Series of Daily Return") +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_x_date(date_labels = "%b %y", date_breaks="2 months")

We can find some large events, though this is not a very neat plot. I am going to make subplot by ticker to give more clear comparison.

g <- ggplot(return_data, aes(x=date, y=return))
g + geom_line(color="dark blue") +
    labs(y="Daily Return", title="Series of Daily Return 2") +
    facet_wrap(~ticker) +
    theme(plot.title = element_text(hjust = 0.5)) +
    scale_x_date(date_labels = "%b %y", date_breaks="2 months")

From this plot, we can see that the daily returns jumps around a lot as we might expect. They basically cluster around zero. There are some unpredictable variation, especially some really large events in TSLA and ZM.

To my surprise, GOOGL is much more stable here than in series of daily closing price plot.

Analysis on Simple Moving Average

A q-day moving average is, for a series $x_t$ and a point in time t, the average of the past q days: that is, if $MA^q_t$ denotes a moving average process, then: $MA_t^q=\frac{1}{q}\sum{x_{t-1}}$

This indicator is interesting because it helps us to identify trends and smooths noises from prices. That is, the bigger the days window for the moving average calculation, smaller is the MA responsiveness to price variation. Now let’s calculate two moving averages for the stock prices series, one with 10 days window and the other with 30 days:

get_moving_mean <- function(data_set,window){
    moving_mean <- function(x,window){
        moving_mean <- rollmean(x, k=window, fill=list(NA, NULL, NA), align="right")
        return(moving_mean)
    }
    new_data <- data_set %>% group_by(ticker) %>%
        mutate(mean10=moving_mean(close,10),mean30=moving_mean(close,30))
    return(new_data)
}
#test new function
moving_mean_data <- stock_data %>% get_moving_mean(10) %>%  get_moving_mean(30) %>% select(ticker, date, close, mean10, mean30)
moving_mean_data

## # A tibble: 1,016 x 5
## # Groups:   ticker [4]
##    ticker date       close mean10
##    <chr>  <date>     <dbl>  <dbl>
##  1 AAPL   2021-06-01  124.    NA 
##  2 AAPL   2021-06-02  125.    NA 
##  3 AAPL   2021-06-03  124.    NA 
##  4 AAPL   2021-06-04  126.    NA 
##  5 AAPL   2021-06-07  126.    NA 
##  6 AAPL   2021-06-08  127.    NA 
##  7 AAPL   2021-06-09  127.    NA 
##  8 AAPL   2021-06-10  126.    NA 
##  9 AAPL   2021-06-11  127.    NA 
## 10 AAPL   2021-06-14  130.   126.
## # ... with 1,006 more rows, and 1
## #   more variable: mean30 <dbl>

We calculated the two moving average using 10 and 30 days of windows, filling the values with NA and using the periods in the left. Afterwards, we can plot both series in the same graphic of prices to identify trends.

g <- ggplot(moving_mean_data, aes(x=date))
g + geom_line(aes(y=close, color="close")) +
    geom_line(aes(y=mean10, color="mean10")) +
    geom_line(aes(y=mean30, color="mean30")) +
    labs(x="Date", y="Price", title="Series of Close Price And Moving Mean") +
    theme(plot.title = element_text(hjust = 0.5)) +
    facet_wrap(~ticker, scales="free")

    scale_x_date(date_labels = "%b %y", date_breaks="2 months")

## <ScaleContinuousDate>
##  Range:  
##  Limits:    0 --    1

As what we expected, the line for mean30 is the most smooth one.

From this plot we can see:

The moving means and close price of ZM continued to fall.
AAPL had a big increase from Oct/2021 to Apr/2022.
In 2021, the moving means of TSLA went up, then it started to wave in a large range.
From April 2022, all these four tickers started to fall down.

Relationships

relationship between volume and returns

I guess there would be some relationship between volume and returns.

g <- ggplot(return_data, aes(x=return, y=volume))
g + geom_point(aes(color=ticker)) 

We can see that days with larger absolute return values are more likely to have larger volume. But larger volumes don’t mean larger absolute return values.

correlation between tickers’ returns

#change tibble to wider tibble with ticker code as column names
new_wider <- return_data %>% select(ticker, date, return) %>% pivot_wider(names_from=ticker, values_from=return)
#calculate correlation between tickers' returns
GGally::ggpairs(new_wider%>%select(-date))

What is very interesting is that these four tickers’ returns have pair-wise positive correlation. Maybe this is because they are all top tech company ? Or this is a normal phenomenon in the past year?

The correlation between AAPL and GOOGL is as high as 0.7.

g <- ggplot(new_wider, aes(x=AAPL, y=GOOGL))
g + geom_point(position="jitter") +
    geom_smooth(method=lm, colors="blue") +
    geom_text(x=0.03, y=-0.025, size=5, label="Correlation = 0.704")

The linear regression line seems fit the data pretty well.