Commit 0339ef92 authored by Markus Mößler's avatar Markus Mößler
Browse files

added solution with some notes

parent e2cd3b71
Loading
Loading
Loading
Loading
+650 −0
Original line number Diff line number Diff line
---
output: html_document
editor_options: 
  chunk_output_type: console
---

# Solution SQL Exercises

## 0. Prepare script

```{r}
rm(list=ls())
options(stringsAsFactors = FALSE)
library("RPostgreSQL")
```

## 1. Connect to database

### Define access credentials

```{r}
dsn_database <- "aidaho"   # Specify the name of your Database
dsn_hostname <- "193.196.53.49"  # localhost = 127.0.0.1
dsn_port <- "8001"                # Specify your port number. e.g. 98939
dsn_uid <- "student"         # Specify your username. e.g. "admin"
dsn_pwd <- "aidaho"        # Specify your password. e.g. "xxx"
```

##### Notes

1) Use the database credentials on the exercise sheet
2) Use the credentials of the database user created based on the `init.sh` file

Note also:

* If you work on the remote database provided by Johannes you cannot view other database users or roles. For this you have to log in as superuser, i.e. you need Johannes' superuser credentials
* If you run the database locally using the docker compose file provided in the exercise sheet you can set your own superuser credentials with which you can also view the created student user role.

### Establish connection

```{r}
tryCatch({
    drv <- dbDriver("PostgreSQL")
    print("Connecting to Database…")
    connect <- dbConnect(drv, 
                         dbname = dsn_database,
                         host = dsn_hostname, 
                         port = dsn_port,
                         user = dsn_uid, 
                         password = dsn_pwd)
    print("Database Connected!")
},
error=function(cond) {
    print("Unable to connect to Database.")
}
)
```

##### Notes

* Use the `dbDriver` function of the `DBI` package to create a new `PostgreSQLDriver` driver object using the `RPostgreSQL` package.
* Use the `dbConnect` function of the `DBI` package to connect to the database.

### Check connection

```{r}
# Check Connection
res <- dbSendQuery(connect,"SELECT version();")
dbFetch(res, n = -1)
```

##### Note

* Use a simple hello world query with the `dbFetch` function of the `DBI` package to check the connection.
* Note,  `n = -1` sets the maximum number of rows to infinity.

## 2. Get an overview over the database

### Queries

```{r}
res <- dbSendQuery(connect," SELECT * FROM iex.trade_reports LIMIT 10;")
dbFetch(res, n = -1)
```

```{r}
res <- dbSendQuery(connect,"SELECT column_name, data_type 
    FROM information_schema.columns 
    WHERE table_schema = 'iex';")
dbFetch(res, n = -1)
```

### Question 1

What do the above queries return?

* The first query returns the 10 first observations from the table `iex.trade_reports`.
* The second query returns the data types of the columns within the table.

### Question 2

What other tables does the `information_schema` contain?

```{r}
res <- dbSendQuery(connect, "
  SELECT table_name 
  FROM information_schema.tables 
  WHERE table_schema = 'information_schema';
")
dbFetch(res, n = -1)
```

### Question 3

What information do the columns of `iex.trade_reports` contain?

```{r}
res <- dbSendQuery(connect,"SELECT column_name, data_type 
    FROM information_schema.columns 
   WHERE table_schema = 'iex' AND table_name = 'trade_reports';")
dbFetch(res, n = -1)
```

* `ordinal`: Ordinal number that IDs the timestamp
* `timestamp`: The timestamp of the trade up to 6 digit precision
* `flags`: The trade flag as used by the IEX
* `symbol` The stock ticker
* `size`: The size of the transaction (how many shares have been transacted)
* `price`: The price of the trade
* `trade_id`: Id number the identifies the transaction

### Question 4

Does a primary key exist in the table?

```{r, cache=TRUE}
res <- dbSendQuery(connect,"SELECT column_name, is_nullable  
    FROM information_schema.columns 
   WHERE table_schema = 'iex' AND table_name = 'trade_reports';")
fetch <- dbFetch(res, n = -1)[]

non_null <- fetch[which(fetch$is_nullable == "NO"),]

for (ii in 1:nrow(non_null)) {
  query <- paste0(
    "SELECT COUNT(*) AS total_rows,
      COUNT(DISTINCT ", 
      non_null[ii, 1],
      ") AS unique_values
    FROM iex.trade_reports;
  ")
  res <- dbSendQuery(connect, query)
  print(non_null[ii, 1])
  print(dbFetch(res, n = -1))
}
```

##### Note

* The information schema itself is a schema named information_schema. This schema automatically exists in all databases [see PostgreSQL Documentation](https://www.postgresql.org/docs/current/infoschema-schema.html)
* The full hierarchy of the PostgreSQL database system is: cluster, database, schema, table (or some other kind of object, such as a function) [see PostgreSQL Documentation](https://www.postgresql.org/docs/17/manage-ag-overview.html).
* All columns that have `is_nullable = NO` in `information_schema.columns` potentially belong to the primary key.
* The column `ordinal` is already not null and unique and thus itself a primary key candidate.
* [see PostgreSQL Documentation](https://www.postgresql.org/docs/17/sql-createtable.html)

## 3. Short Queries

### Query 1

How many distinct symbols does the table contain?

```{r}
res <- dbSendQuery(connect,"SELECT COUNT(DISTINCT symbol) from iex.trade_reports;")
dbFetch(res, n = -1)
```

### Query 2

How many different financial instruments (symbols) have been traded between 2022−01−24 10:00:00−05` and `2022−01−24 11:00:00−05`?

```{r}
res <- dbSendQuery(connect,"SELECT COUNT(DISTINCT symbol) 
                            FROM iex.trade_reports 
                            WHERE timestamp BETWEEN '2022-01-24 10:00:00-05' AND ' 2022-01-24 11:00:00-05';")
dbFetch(res, n = -1)
```

### Query 3

How many trades of `AAPL` have taken place within the trading hours 10h00 and 11h00?

```{r}
res <- dbSendQuery(connect,"SELECT COUNT(DISTINCT TRADE_ID) 
                            FROM iex.trade_reports 
                            WHERE timestamp BETWEEN '2022-01-24 10:00:00 -5:00:00' AND '2022-01-24 11:00:00 -5:00:00'
                            AND symbol = 'AAPL'")
dbFetch(res, n = -1)
```

### Query 4

Calculate the average price for each symbol in the sample?

```{r}
res <- dbSendQuery(connect,"SELECT symbol,AVG(price) 
                            FROM iex.trade_reports 
                            GROUP BY symbol;")
head(dbFetch(res, n = -1), 10)
```

### Query 5

Which symbol has the highest average price?

```{r}
res <- dbSendQuery(connect,"SELECT symbol, AVG(price) 
                            FROM iex.trade_reports 
                            GROUP BY symbol 
                            ORDER BY AVG(price) DESC LIMIT 1;;")
dbFetch(res, n = -1)
```

### Query 6

How many symbols have an average price above 1000 USD?

```{r}
res <- dbSendQuery(connect,"SELECT symbol, AVG(price) 
                            FROM iex.trade_reports 
                            GROUP BY symbol 
                            HAVING AVG(price) > 1000;")
dbFetch(res, n = -1)
```

## 4. Last price within 5-minute intervals

### Step 1) Construct inner query: Group timestamps into 5-minute intervals

Convert the raw trade timestamps into 5-minute buckets (or any chosen interval).

```{r}
ticker <- "MSFT"
interval <- "5"
innerquery <- paste0("SELECT TO_TIMESTAMP(
    FLOOR(
        EXTRACT(epoch FROM timestamp) / 
            EXTRACT(epoch FROM INTERVAL '",interval," min')
        ) * EXTRACT(epoch FROM INTERVAL '",interval," min')
    ) as time_interval,
    * 
        FROM iex.trade_reports 
    WHERE symbol = '",ticker,"'
    ORDER BY timestamp")

# test innerquery
res <- dbSendQuery(connect,paste0(innerquery," LIMIT 20"))
dbFetch(res, n = -1)
```

##### Note

* Groups all timestamps into `interval`-minute bins (like 5, 10, 15...).
* Adds a new column `time_interval` for each row.
* Filters by the `ticker` (e.g., AAPL, MSFT).
* Prepares for further aggregation.

### Step 2) Construct mezzanine query: Assign a row number within each interval

Use `ROW_NUMBER()` to rank trades within each interval by `timestamp DESC`, so the most recent trade per interval gets row number 1.

```{r}
mezzaninequery <- paste0("SELECT ",
                         "ROW_NUMBER() OVER (PARTITION BY time_interval ORDER BY timestamp DESC, ordinal DESC) as rownumber, ",
                         "* ",
                         "FROM ",
                         "(",innerquery,") as iq")
# test mezzaninequery
res <- dbSendQuery(connect,paste0(mezzaninequery," LIMIT 20"))
dbFetch(res, n = -1)
```

##### Note

* Within each time interval (5-min group), orders the trades from most recent to oldest as well as ordinal to account for nun-unique timestamps.
* Assigns a `rownumber = 1` to the last trade in each interval (i.e., the one with the latest timestamp).

### Step 3) Counstruct outerquery: Filter to the latest trade per interval

Filter the MEZZANINEQUERY to keep only the rows where `rownumber = 1` - i.e., the latest trade in each interval.

```{r}
outerquery <- paste0("SELECT * ",
                     "FROM ",
                     "(",mezzaninequery,") as mq ",
                     "WHERE rownumber=1 ",
                     "ORDER BY time_interval")

# test outerquery (no limit)
res <- dbSendQuery(connect,paste0(outerquery))
dbFetch(res, n = -1)
```

### Step 4) Construct a function for abstraction

Create a reusable function that builds the entire query for any interval and ticker.

```{r}
get_outerquery <- function(interval,ticker){
    innerquery <- paste0("SELECT TO_TIMESTAMP(
    FLOOR(
        EXTRACT(epoch FROM timestamp) / 
            EXTRACT(epoch FROM INTERVAL '",interval," min')
        ) * EXTRACT(epoch FROM INTERVAL '",interval," min')
    ) as time_interval,
    * 
        FROM iex.trade_reports 
    WHERE symbol = '",ticker,"'
    ORDER BY timestamp")

    mezzaninequery <- paste0("SELECT ",
                             "row_number() OVER (PARTITION BY time_interval ORDER BY timestamp DESC) as rownumber, ",
                             "* ",
                             "FROM ",
                             "(",innerquery,") as iq")
    outerquery <- paste0("SELECT * ",
                         "FROM ",
                         "(",mezzaninequery,") as mq ",
                         "WHERE rownumber=1")
    return(outerquery)
}
```

### Step 5) Test function and send query

Run Query and Fetch Results.

```{r}
# test outerquery (no limit)
outerquery <- get_outerquery(interval="5",ticker="AAPL")
res <- dbSendQuery(connect,outerquery)
dbFetch(res, n = -1)
```

#### Note

* This sends the query to the database and returns the result - one row per interval, with the most recent trade price.

## 5. Merging

### 5.i Inner join

Construct an inner join query between AAPL and MSFT’s price data (both at 5-minute intervals), keeping only:

* The shared time intervals between them,
* Their respective symbols and prices.
* [see W3 School SQL](https://www.w3schools.com/sql/sql_join_inner.asp)

#### Step 1) Construct outer query for `AAPL` and `MSFT`.

Get two SQL subqueries (`oq.AAPL`, `oq.MSFT``) that retrieve 5-minute price data for `AAPL` and `MSFT`.

```{r}
oq.AAPL <- get_outerquery(interval="5",ticker="AAPL")
oq.MSFT <- get_outerquery(interval="5",ticker="MSFT")
```

#### Step 2) Construct join statement.

```{r}
join_statement <- paste0("SELECT a.time_interval as ati,
                                   b.time_interval as bti,
                                   a.symbol as symbol_a,
                                   b.symbol as symbol_b,
                                   a.price as price_a,
                                   b.price as price_b 
                            FROM ",
                           "(",oq.AAPL,") as a ",
                           " INNER JOIN ",
                           "(",oq.MSFT,") as b ",
                           "ON a.time_interval = b.time_interval;")
```

##### Note

* Joins the AAPL and MSFT subqueries only where their `time_intervals` match.
* Selects:
  * `a.time_interval` and `b.time_interval` (they should be identical — this is mostly for verification),
  * `symbol` and `price` from both.

#### Step 3) Send query

```{r}
# test outerquery (no limit)
res <- dbSendQuery(connect,join_statement)
dbFetch(res, n = -1)
```

### 5.ii Left join

* Include all regular time intervals (e.g., 5-minute marks) between the first and last timestamps in the Apple data, and joins the corresponding Microsoft price data wherever MSFT has a price at that exact timestamp.
* [see W3 School SQL](https://www.w3schools.com/sql/sql_join_left.asp)

#### Step 1) Construct `minmax_time`

Find the earliest and latest timestamps in the dataset for AAPL.

```{r}
interval <- "5"

# Get the minimum and maximum time_interval
minmax_time_str <- paste0("SELECT min(time_interval),max(time_interval) from (",oq.AAPL,") as a;")
res <- dbSendQuery(connect,minmax_time_str)
minmax_time <- dbFetch(res, n = -1)
```

##### Note

* The SQL query gets the `MIN()` and `MAX()` of `time_interval` here from apple to determine the time range you’ll need.
* This is necessary because `generate_series()` needs a start and end time to create intervals.

#### Step 2) Construct `timeseriesquery`

Generate a series of timestamps (e.g., every 1 or 5 minutes) between the min and max from Step 1.

```{r}
timeseriesquery <- paste0("SELECT generate_series('",format(minmax_time[1],tz="UTC"),"'::TIMESTAMP AT TIME ZONE 'UTC','",
                     format(minmax_time[2],tz="UTC"),"'::TIMESTAMPTZ AT TIME ZONE 'UTC','",interval,"m') as time_interval")
res <- dbSendQuery(connect,timeseriesquery)
dbFetch(res, n = -1)
```

##### Note

* This uses PostgreSQL's `generate_series()` function.
* This result is treated as a table of time intervals that you'll join with the real price data.
* This generates evenly spaced time intervals (e.g., every 5 minutes) from the min to max time.

#### Step 3) Construct `left_join_statement` statement.

Construct a query for a left join between the `timeseriesquery` and the outer query for AAPL containing the time interval, symbol and price.

```{r}
left_join_statement <- paste0("SELECT a.time_interval,
                                   b.symbol,
                                   b.price 
                               FROM ",
                                "(",timeseriesquery,") as a ",
                               " LEFT JOIN ",
                               "(",oq.MSFT,") as b ",
                               "ON a.time_interval = b.time_interval;")

res <- dbSendQuery(connect,left_join_statement)
dbFetch(res, n = -1)
```

##### Note

* `a`: The generated time series (from Step 2).
* `b`: Real stock data (e.g., MSFT).
* Ensures all time intervals are preserved - even if MSFT has no price at some times.

#### Step 4) Construct a function for abstraction

Construct an *R* function that determines the `minmax_time` variable based on the inputted symbol and interval length and returns the string for the left join query.

```{r}
get_Xmin_prices <- function(interval,ticker){
    oq <- get_outerquery(interval=interval,ticker=ticker)
    # Get the minimum and maximum time_interval
    minmax_time_str <- paste0("SELECT min(time_interval),max(time_interval) from (",oq,") as a;")
    res <- dbSendQuery(connect,minmax_time_str)
    minmax_time <- dbFetch(res, n = -1)
    
    timeseriesquery <- paste0("SELECT generate_series('",format(minmax_time[1],tz="UTC"),"'::TIMESTAMP AT TIME ZONE 'UTC','",
                              format(minmax_time[2],tz="UTC"),"'::TIMESTAMPTZ AT TIME ZONE 'UTC','",interval,"m') as time_interval")
    
    
    left_join_statement <- paste0("SELECT a.time_interval,
                                   b.symbol,
                                   b.price 
                               FROM ",
                                  "(",timeseriesquery,") as a ",
                                  " LEFT JOIN ",
                                  "(",oq,") as b ",
                                  "ON a.time_interval = b.time_interval")
    return(left_join_statement)
}
```

##### Note

* Generalize everything into a function that takes any:
  * `interval` (e.g., "1" for 1-minute)
  * `ticker` (e.g., "GME")
* And returns a valid SQL left join query string.

#### Step 5) Test function and send query

Call the function for the symbol GME and an interval length of 1 minute. Submit the resulting query. What do you observe?

```{r}
# test function and send query
test1 <- get_Xmin_prices(interval="1",ticker ="GME")
res <- dbSendQuery(connect,test1)
dbFetch(res, n = -1)
```

## 6. Fill Gaps

### Step 1) Construct first query

```{r}
query1 <- get_Xmin_prices(interval="1",ticker ="GME")
```

##### Note

* Get regular 1 minute time intervals for the stock GameStop Corp. with potential missing values for symbol and price.
* A complete list of timestamps, with `symbol` and `price` possibly containing NULLs where there was no trade.
* This is important because we now have all time intervals, including the "gaps".

### Step 2) Construct the second query

```{r}
query2 <-  paste0("SELECT count(price) OVER (PARTITION BY 1 ORDER BY time_interval) AS count_prices, *
                    FROM (",query1," ) as q1")
```

##### Note

* This query adds a `count_prices` column.
* `count(price)` in a window function ignores NULLs, so it only increments when a price is present.
* So all rows between two non-null prices will share the same count.
* Note also
  * `PARTITION BY 1`: Means no actual partitioning; all rows are in the same partition (i.e., entire dataset).
  * `ORDER BY time_interval`: The cumulative count moves forward in time.

### Step 3) Construct the final query

```{r}
res_query <- paste0("SELECT count_prices,time_interval,symbol,price, ",
       "first_value(price) OVER part_window AS price_filled ",
       "FROM (",
       query2,
       ") as foo WINDOW part_window AS (PARTITION BY count_prices ORDER BY time_interval)")
```

##### Note

* For each group of rows with the same `count_prices`, we use `FIRST_VALUE(price)` to get the most recent non-null price before or at that point.
* Because `count_prices` only changes when a new non-null price appears, all the NULLs in between are "filled" with the last known value.

### Step 4) Send query

```{r}
res <- dbSendQuery(connect,query2)
dbFetch(res, n = -1)
```

### Step 5) Construct a function for abstraction

Write a function that returns a modifed query2 in which the symbol and the price are carried forward. Call the function `get_Xmin_prices_no_gaps`.

```{r}
get_Xmin_prices_no_gaps <- function(interval,ticker){
    query1 <- get_Xmin_prices(interval=interval,ticker =ticker)
    
    query2 <-  paste0("SELECT count(price) OVER (PARTITION BY 1 ORDER BY time_interval) AS count_prices, *
                    FROM (",query1," ) as GME")
    
    res_query <- paste0("SELECT count_prices,time_interval,",
                    "first_value(symbol) OVER part_window AS symbol, ",
                    "first_value(price) OVER part_window AS price ",
                    "FROM (",
                    query2,
                    ") as foo WINDOW part_window AS (PARTITION BY count_prices ORDER BY time_interval)")
    return(res_query)
}
```

##### Note

* Produces a clean, gap-free, forward-filled time series of prices for any given ticker and interval.

### Step 6) Test function and send query

```{r}
# Test function and send query
query <- get_Xmin_prices_no_gaps(interval="1",ticker ="GME")
res <- dbSendQuery(connect,query)
dbFetch(res, n = -1)
```

## 7. Calculate logarithmic first differences (log-returns)

### Step 1) Construct query to query cleaned series

```{r}
clean_query <- get_Xmin_prices_no_gaps(interval="1",ticker ="GME")
```

##### Note

* Calls the `get_Xmin_prices_no_gaps()` function.
* For 'GME' stock and 1-minute intervals.
* Ensures no missing time intervals in the price data (fills with `NA` if price is missing).
* Returns an SQL string with clean data ready for log-return computation.

### Step 2) Construct lagged query to query logarithmic first differences of the price series

```{r}
lagged <- paste0("SELECT *, log(price) - lag(log(price),1) OVER (ORDER BY time_interval) as log_return FROM ",
       "(",clean_query,") as cq;")
```

##### Note

* `lag()` fetches the previous log(price), and the difference is the log return.
* `OVER (ORDER BY time_interval)` ensures the order is by time.

### Step 3) Send the lagged query.

```{r}
res <- dbSendQuery(connect,lagged)
dbFetch(res, n = -1)
```

### Step 4) Construct a function for abstraction

Write a function that returns the string for a query that based on a ticker symbol and the interval length calculates the log-returns.

```{r}
get_first_differences <- function(interval,ticker){
    
    clean_query <- get_Xmin_prices_no_gaps(interval=interval,ticker =ticker)
    
    lagged <- paste0("SELECT time_interval,symbol, log(price) - lag(log(price),1) OVER (ORDER BY time_interval) as log_return FROM ",
                     "(",clean_query,") as cq;")
    return(lagged)
}
```

### Step 5) Test function and send query

```{r}
# Test function and query
test_lagged <- get_first_differences(interval="10",ticker ="TSLA")
res <- dbSendQuery(connect,test_lagged)
dbFetch(res, n = -1)
```
+3000 −0

File added.

Preview size limit exceeded, changes collapsed.