---
title: "Data Model"
author: "Jonathan Callahan"
date: "2022-10-31"
output: rmarkdown::html_vignette
vignette: >
  %\VignetteIndexEntry{Data Model}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r, echo=FALSE}
knitr::opts_chunk$set(fig.width = 7, fig.height = 5)
```

This vignette explores the _mts_monitor_ data model used throughout the 
**AirMonitor** package to store and work with monitoring data.

The **AirMonitor** package is designed to provide a compact, full-featured suite 
of utilities for working with PM2.5 data. A 
uniform data model provides consistent data access across monitoring data 
available from different agencies. The core data model in this package is 
defined by the _mts_monitor_ object used to store data associated with groups of 
individual monitors.

To work efficiently with the package it is important that you understand the structure 
of this data object and the functions that operate on it. Package functions whose
names begin with `monitor_`,  expect objects of class _mts_monitor_ as their first 
argument. (*'mts' stands for 'Multiple Time Series'*)

## Data Model

The **AirMonitor** package uses the _mts_ data model defined in the
**[MazamaTimeSeries](https://mazamascience.github.io/MazamaTimeSeries/)**
package.

In this data model, each unique time series is referred to as a 
_"device-deployment"_ -- a time series collected by a particular device at a 
specific location. Multiple device-deployments are stored in memory as a
_mts_monitor_ object, typically called `monitor`. Each `monitor` is just an \pkg{R} 
list with two dataframes.

`monitor$meta` -- rows = unique device-deployments; cols = device/location metadata

`monitor$data` -- rows = UTC times; cols = device-deployment data (plus an additional `datetime` column)

A key feature of this data model is the use of the `deviceDeploymentID` as a
"foreign key" that allows `data` columns to be mapped onto the associated
spatial and device metadata in a `meta` row. The following will always be true:

```
identical(names(monitor$data), c('datetime', monitor$meta$deviceDeploymentID))
```

Each column of `monitor$data` represents a time series associated with a particular
device-deployment while each row of `monitor$data` represents a _synoptic_ snapshot of all
measurements made at a particular time. 

In this manner, software can create both time series plots and maps from a single
`monitor` object in memory.

The `data` dataframe contains all hourly measurements organized with rows 
(the 'unlimited' dimension) as unique timesteps and columns as unique 
device-deployments. The very first column is always named `datetime` and contains 
the `POSIXct` datetime in Coordinated Universal Time (UTC). This time axis is 
guaranteed to be a regular hourly axis with no gaps.

The `meta` dataframe contains all metadata associated with device-deployments 
and is organized with rows as unique device-deployments and columns containing
both location and device metadata. The following columns are guaranteed to exist
in the `meta` dataframe. Those marked with "(optional)" may contain `NA`s.
Additional columns may also be present depending on the data source.

* `deviceDeploymentID` -- unique ID associated with a time series
* `deviceID` -- unique location ID
* `deviceType` -- (optional) device type
* `deviceDescription` -- (optional) human readable device description
* `deviceExtra` -- (optional) additional human readable device information 
* `pollutant` -- pollutant name from `AirMonitor::pollutantNames`
* `units` -- one of `"PPM|PPB|UG/M3"`
* `dataIngestSource`-- (optional) source of data
* `dataIngestURL` -- (optional) URL used to access data
* `dataIngestUnitID` -- (optional) instrument identifier used at `dataIngestSource`
* `dataIngestExtra` -- (optional) human readable data ingest information
* `dataIngestDescription`-- (optional) human readable data ingest instructions
* `locationID` -- unique location ID from `MazamaLocationUtils::location_createID()`
* `locationName` -- human readable location name
* `longitude` -- longitude
* `latitude` -- latitude
* `elevation` -- (optional) elevation
* `countryCode` -- ISO 3166-1 alpha-2 country code
* `stateCode` -- ISO 3166-2 alpha-2 state code
* `countyName` -- US county name
* `timezone` -- Olson time zone
* `houseNumber` -- (optional) 
* `street` -- (optional)
* `city` -- (optional)
* `zip` -- (optional)
* `AQSID` -- (optional) EPA AQS unique identifier
* `fullAQSID` -- (optional) EPA AQS unique identifier

**Example 1: Exploring _mts_monitor_ objects**

We will use the built-in "NW_Megafires" dataset and various `monitor_filter~()` 
functions to subset a _mts_monitor_ object which we then examine.

```{r data_model_1}
library(AirMonitor)

# Recipe to select Washington state monitors in August of 2014:
monitor <-
  
  # 1) start with NW Megafires
  NW_Megafires %>%
  
  # 2) filter to only include Washington state
  monitor_filter(stateCode == "WA") %>%
  
  # 3) filter to only include August
  monitor_filterDate(20150801, 20150901) %>%
  
  # 4) remove monitors with all missing values
  monitor_dropEmpty()

# 'mts_monitor' objects can be identified by their class
class(monitor)

# They always have two elements called 'meta' and 'data'
names(monitor)

# Examine the 'meta' dataframe
dim(monitor$meta)
names(monitor$meta)

# Examine the 'data' dataframe
dim(monitor$data)

# This should always be true
identical(names(monitor$data), c('datetime', monitor$meta$deviceDeploymentID))
```
  
**Example 2: Basic manipulation of _mts_monitor_ objects**

The **AirMonitor** package has numerous functions that work with 
_mts_monitor_ objects, all of which begin with `monitor_`. If you need to do 
something that the package functions do not provide, you can manipulate 
_mts_monitor_ objects directly as long as you retain the structure of the data 
model.

Functions that accept and return _mts_monitor_ objects include:

 * `monitor_arrange()`
 * `monitor_aqi()`
 * `monitor_collapse()`
 * `monitor_combine()`
 * `monitor_dailyStatistic()`
 * `monitor_dailyThreshold()`
 * `monitor_dropEmpty()`
 * `monitor_filter()` ( aka `monitor_filterMeta()`)
 * `monitor_filterByDistance()`
 * `monitor_filterDate()`
 * `monitor_filterDatetime()`
 * `monitor_mutate()`
 * `monitor_nowcast()`
 * `monitor_replaceValues()`
 * `monitor_select()` ( aka `monitor_reorder()`)
 * `monitor_selectWhere()`
 * `monitor_slice()`
 * `monitor_trimDate()`

These functions can be used with the **magrittr** package pipe operator (`%>%`)
as in the following example:

```{r monitor_leaflet, results = "hold"}
# First, Obtain the monitor ids by clicking on dots in the interactive map:
NW_Megafires %>% monitor_leaflet()
```

```{r Methow_Valley, results = "hold"}
# Calculate daily means for the Methow Valley from monitors in Twisp and Winthrop

TwispID <- "99a6ee8e126ff8cf_530470009_04"
WinthropID <- "123035bbdc2bc702_530470010_04"

# Recipe to calculate Methow Valley August Means:
Methow_Valley_AugustMeans <- 
  
  # 1) start with NW Megafires
  NW_Megafires %>%
  
  # 2) select monitors from Twisp and Winthrop
  monitor_select(c(TwispID, WinthropID)) %>%
  
  # 3) average them together hour-by-hour
  monitor_collapse(deviceID = 'MethowValley') %>%
  
  # 4) restrict data to August
  monitor_filterDate(20150801, 20150901) %>%
  
  # 5) calculate daily mean
  monitor_dailyStatistic(mean, minHours = 18) %>%
  
  # 6) round data to one decimal place
  monitor_mutate(round, 1)

# Look at the first week
Methow_Valley_AugustMeans$data[1:7,]
```

**Example 3: Advanced manipulation of _mts_monitor_ objects**

The following code demonstrates user creation of a custom function to 
manipulate the `data` tibble from a _mts_monitor_ object with `monitor_mutate()`.

```{r custom_use1}
# Monitors within 100 km of Spokane, WA
Spokane <-
  NW_Megafires %>%
  monitor_filterByDistance(-117.42, 47.70, 100000) %>%
  monitor_filterDate(20150801, 20150901) %>%
  monitor_dropEmpty()

# Show the daily statistic for one week
Spokane %>% 
  monitor_filterDate(20150801, 20150808) %>%
  monitor_dailyStatistic(mean) %>%
  monitor_getData()

# Custom function to convert from metric ug/m3 to imperial grain/gallon 
my_FUN <- function(x) { return( x * 15.43236 / 0.004546 ) }
Spokane %>% 
  monitor_filterDate(20150801, 20150808) %>%
  monitor_mutate(my_FUN) %>%
  monitor_dailyStatistic(mean) %>%
  monitor_getData()
```

Understanding that `monitor$data` is a just a dataframe of measurements
prepended with a `datetime` column, we can pull out the measurements and do analyses
independent of the _mts_monitor_ data model. Here we look for correlations among
the PM2.5 time series.

```{r custom_use2}
# Pull out the time series data to calculate correlations
Spokane_data <- 
  Spokane %>%
  monitor_getData() %>%
  dplyr::select(-1) # omit 'datetime' column

# Provide human readable names
names(Spokane_data) <- Spokane$meta$locationName

# Find correlation among monitors
cor(Spokane_data, use = "complete.obs")
```

This introduction to the _mts_monitor_ data model should be enough to get you started.
Lots more examples are available in the package documentation.

----

_Best of luck exploring and understanding PM 2.5 air quality data!_