dsn_port <- "8001" # Specify your port number. e.g. 98939
dsn_uid <- "student" # Specify your username. e.g. "admin"
dsn_pwd <- "aidaho" # Specify your password. e.g. "xxx"
```
##### Notes
1) Use the database credentials on the exercise sheet
2) Use the credentials of the database user created based on the `init.sh` file
Note also:
* If you work on the remote database provided by Johannes you cannot view other database users or roles. For this you have to log in as superuser, i.e. you need Johannes' superuser credentials
* If you run the database locally using the docker compose file provided in the exercise sheet you can set your own superuser credentials with which you can also view the created student user role.
### Establish connection
```{r}
tryCatch({
drv <- dbDriver("PostgreSQL")
print("Connecting to Database…")
connect <- dbConnect(drv,
dbname = dsn_database,
host = dsn_hostname,
port = dsn_port,
user = dsn_uid,
password = dsn_pwd)
print("Database Connected!")
},
error=function(cond) {
print("Unable to connect to Database.")
}
)
```
##### Notes
* Use the `dbDriver` function of the `DBI` package to create a new `PostgreSQLDriver` driver object using the `RPostgreSQL` package.
* Use the `dbConnect` function of the `DBI` package to connect to the database.
### Check connection
```{r}
# Check Connection
res <- dbSendQuery(connect,"SELECT version();")
dbFetch(res, n = -1)
```
##### Note
* Use a simple hello world query with the `dbFetch` function of the `DBI` package to check the connection.
* Note, `n = -1` sets the maximum number of rows to infinity.
## 2. Get an overview over the database
### Queries
```{r}
res <- dbSendQuery(connect," SELECT * FROM iex.trade_reports LIMIT 10;")
dbFetch(res, n = -1)
```
```{r}
res <- dbSendQuery(connect,"SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'iex';")
dbFetch(res, n = -1)
```
### Question 1
What do the above queries return?
* The first query returns the 10 first observations from the table `iex.trade_reports`.
* The second query returns the data types of the columns within the table.
### Question 2
What other tables does the `information_schema` contain?
```{r}
res <- dbSendQuery(connect, "
SELECT table_name
FROM information_schema.tables
WHERE table_schema = 'information_schema';
")
dbFetch(res, n = -1)
```
### Question 3
What information do the columns of `iex.trade_reports` contain?
```{r}
res <- dbSendQuery(connect,"SELECT column_name, data_type
FROM information_schema.columns
WHERE table_schema = 'iex' AND table_name = 'trade_reports';")
dbFetch(res, n = -1)
```
* `ordinal`: Ordinal number that IDs the timestamp
* `timestamp`: The timestamp of the trade up to 6 digit precision
* `flags`: The trade flag as used by the IEX
* `symbol` The stock ticker
* `size`: The size of the transaction (how many shares have been transacted)
* `price`: The price of the trade
* `trade_id`: Id number the identifies the transaction
### Question 4
Does a primary key exist in the table?
```{r, cache=TRUE}
res <- dbSendQuery(connect,"SELECT column_name, is_nullable
FROM information_schema.columns
WHERE table_schema = 'iex' AND table_name = 'trade_reports';")
* The information schema itself is a schema named information_schema. This schema automatically exists in all databases [see PostgreSQL Documentation](https://www.postgresql.org/docs/current/infoschema-schema.html)
* The full hierarchy of the PostgreSQL database system is: cluster, database, schema, table (or some other kind of object, such as a function) [see PostgreSQL Documentation](https://www.postgresql.org/docs/17/manage-ag-overview.html).
* All columns that have `is_nullable = NO` in `information_schema.columns` potentially belong to the primary key.
* The column `ordinal` is already not null and unique and thus itself a primary key candidate.
join_statement <- paste0("SELECT a.time_interval as ati,
b.time_interval as bti,
a.symbol as symbol_a,
b.symbol as symbol_b,
a.price as price_a,
b.price as price_b
FROM ",
"(",oq.AAPL,") as a ",
" INNER JOIN ",
"(",oq.MSFT,") as b ",
"ON a.time_interval = b.time_interval;")
```
##### Note
* Joins the AAPL and MSFT subqueries only where their `time_intervals` match.
* Selects:
* `a.time_interval` and `b.time_interval` (they should be identical — this is mostly for verification),
* `symbol` and `price` from both.
#### Step 3) Send query
```{r}
# test outerquery (no limit)
res <- dbSendQuery(connect,join_statement)
dbFetch(res, n = -1)
```
### 5.ii Left join
* Include all regular time intervals (e.g., 5-minute marks) between the first and last timestamps in the Apple data, and joins the corresponding Microsoft price data wherever MSFT has a price at that exact timestamp.
* [see W3 School SQL](https://www.w3schools.com/sql/sql_join_left.asp)
#### Step 1) Construct `minmax_time`
Find the earliest and latest timestamps in the dataset for AAPL.
```{r}
interval <- "5"
# Get the minimum and maximum time_interval
minmax_time_str <- paste0("SELECT min(time_interval),max(time_interval) from (",oq.AAPL,") as a;")
res <- dbSendQuery(connect,minmax_time_str)
minmax_time <- dbFetch(res, n = -1)
```
##### Note
* The SQL query gets the `MIN()` and `MAX()` of `time_interval` here from apple to determine the time range you’ll need.
* This is necessary because `generate_series()` needs a start and end time to create intervals.
#### Step 2) Construct `timeseriesquery`
Generate a series of timestamps (e.g., every 1 or 5 minutes) between the min and max from Step 1.
```{r}
timeseriesquery <- paste0("SELECT generate_series('",format(minmax_time[1],tz="UTC"),"'::TIMESTAMP AT TIME ZONE 'UTC','",
format(minmax_time[2],tz="UTC"),"'::TIMESTAMPTZ AT TIME ZONE 'UTC','",interval,"m') as time_interval")
res <- dbSendQuery(connect,timeseriesquery)
dbFetch(res, n = -1)
```
##### Note
* This uses PostgreSQL's `generate_series()` function.
* This result is treated as a table of time intervals that you'll join with the real price data.
* This generates evenly spaced time intervals (e.g., every 5 minutes) from the min to max time.
* Ensures all time intervals are preserved - even if MSFT has no price at some times.
#### Step 4) Construct a function for abstraction
Construct an *R* function that determines the `minmax_time` variable based on the inputted symbol and interval length and returns the string for the left join query.
```{r}
get_Xmin_prices <- function(interval,ticker){
oq <- get_outerquery(interval=interval,ticker=ticker)
# Get the minimum and maximum time_interval
minmax_time_str <- paste0("SELECT min(time_interval),max(time_interval) from (",oq,") as a;")
res <- dbSendQuery(connect,minmax_time_str)
minmax_time <- dbFetch(res, n = -1)
timeseriesquery <- paste0("SELECT generate_series('",format(minmax_time[1],tz="UTC"),"'::TIMESTAMP AT TIME ZONE 'UTC','",
format(minmax_time[2],tz="UTC"),"'::TIMESTAMPTZ AT TIME ZONE 'UTC','",interval,"m') as time_interval")