{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Optimal interpolation" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optimal interpolation is a method used to combine spatially distributed data (the \"background field\") with point-based observations. This technique adjusts the entire field by incorporating deviations between the observed data and the field at the observation points, resulting in a statistically optimal adjustment of the background field. For example, it can be used to blend reanalysis precipitation data (such as ERA5) with actual observational records, ensuring that the reanalysis precipitation is corrected over the entire domain.\n", "\n", "This page demonstrates how to use `xHydro` to perform optimal interpolation for hydrological modeling by integrating field-like simulations with point observations. In this case, the background field consists of outputs from a distributed hydrological model, while the point observations correspond to real hydrometric station measurements. The goal is to correct the background field (i.e., the hydrological model outputs) using optimal interpolation techniques, following the approach outlined in Lachance-Cloutier et al. (2017).\n", "\n", "*Lachance-Cloutier, S., Turcotte, R. and Cyr, J.F., 2017. Combining streamflow observations and hydrologic simulations for the retrospective estimation of daily streamflow for ungauged rivers in southern Quebec (Canada). Journal of hydrology, 550, pp.294-306.*\n", "\n", "Optimal interpolation relies on a set of hyperparameters. Some of these are more complex than others, so let’s break down the main steps.\n", "\n", "The first step is to compute the differences (or \"departures\") between the observed and simulated flow at the stations where both values are available. These differences must be scaled by the catchment area to ensure that errors are relative and can be properly interpolated. Also, we take the logarithm of these scaled values to prevent negative streamflow during extrapolation. We will reverse this transformation later in the process.\n", "\n", "Next, we need some additional information, which may or may not be available for our observation and simulation sites. These include estimates of:\n", "\n", "* The variance of the observations at the gauged sites.\n", "* The variance of the simulated flows at the observation sites.\n", "* The variance of the simulated flows at the estimation sites, including those that also correspond to an observation site.\n", "\n", "These can be estimated in real-world applications using long time series of log-transformed and scaled flows, or from measurement errors associated with the instrumentation at gauged sites. These parameters can also be fine-tuned based on past experience or through trial-and-error.\n", "\n", "The final component we need is the error covariance function (ECF). In simple terms, optimal interpolation takes into account the distance between an observation (or multiple observations) and the site where we need to estimate a new flow value. Intuitively, a simulation station close to an observation station should have a high correlation with it, while a station farther away will have a lower correlation. Therefore, we need a covariance function that estimates:\n", "\n", "1. The degree of covariability between an observed and simulated point.\n", "2. The distance between these points. \n", "\n", "The ECF function is key to this, and several models of it exist in the literature. In many cases, a model form will be chosen *a priori*, and its parameters will be adjusted to best represent the covariance between points.\n", "\n", "In this test example, we don’t have enough points or time steps to develop a meaningful model (or parameterization) from the data. As a result, we will impose a model. `xHydro` includes four built-in models, where `par[0]` and `par[1]` are the model parameters to be calibrated (under normal circumstances), and *h* represents the distance between points:\n", "\n", "* **Model 1**: \n", " $$\n", " \\begin{flalign*}\n", " &\\text{par}[0] \\cdot \\left( 1 + \\frac{h}{\\text{par}[1]} \\right) \\cdot \\exp\\left(- \\frac{h}{\\text{par}[1]} \\right) && \\text{— From Lachance-Cloutier et al. 2017.}\n", " \\end{flalign*}\n", " $$\n", "* **Model 2**:\n", " $$\n", " \\begin{flalign*}\n", " &\\text{par}[0] \\cdot \\exp\\left( -0.5 \\cdot \\left( \\frac{h}{\\text{par}[1]} \\right)^2 \\right) &&\n", " \\end{flalign*}\n", " $$\n", "* **Model 3**:\n", " $$\n", " \\begin{flalign*}\n", " &\\text{par}[0] \\cdot \\exp\\left( -\\frac{h}{\\text{par}[1]} \\right) &&\n", " \\end{flalign*}\n", " $$\n", "* **Model 4**:\n", " $$\n", " \\begin{flalign*}\n", " &\\text{par}[0] \\cdot \\exp\\left( -\\frac{h^{\\text{par}[1]}}{\\text{par}[0]} \\right) &&\n", " \\end{flalign*}\n", " $$" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import datetime as dt\n", "from functools import partial\n", "from pathlib import Path\n", "\n", "import matplotlib.pyplot as plt\n", "import numpy as np\n", "import pandas as pd\n", "import pooch\n", "import xarray as xr\n", "from scipy.stats import norm\n", "\n", "import xhydro as xh\n", "import xhydro.optimal_interpolation\n", "from xhydro.testing.helpers import deveraux" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example with HYDROTEL data" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Optimal interpolation relies on both observed and simulated datasets and requires the following information:\n", "\n", "* Observed data for the gauged locations\n", "* Simulated data for all locations\n", "* Catchment areas (for error scaling)\n", "* Catchment latitude and longitude (to develop the spatial error model)\n", "\n", "This example will use a subset of data generated using the HYDROTEL hydrological model." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get data\n", "test_data_path = deveraux().fetch(\n", " \"optimal_interpolation/OI_data_corrected.zip\",\n", " pooch.Unzip(),\n", ")\n", "directory_to_extract_to = Path(test_data_path[0]).parent\n", "\n", "# Read-in all the files and set to paths that we can access later.\n", "qobs = xr.open_dataset(directory_to_extract_to / \"A20_HYDOBS_TEST_corrected.nc\")\n", "qsim = xr.open_dataset(directory_to_extract_to / \"A20_HYDREP_TEST_corrected.nc\")\n", "station_correspondence = xr.open_dataset(\n", " directory_to_extract_to / \"station_correspondence.nc\"\n", ")\n", "df_validation = pd.read_csv(\n", " directory_to_extract_to / \"stations_retenues_validation_croisee.csv\",\n", " sep=None,\n", " dtype=str,\n", ")\n", "observation_stations = list(df_validation[\"No_station\"])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are three datasets, as well as a list:\n", "\n", "- **qobs**: The dataset containing point observations and station metadata.\n", "- **qsim**: The dataset containing the background field simulations (e.g. the raw HYDROTEL results), including simulated station metadata.\n", "- **station_correspondence**: A dataset that simply links station identifiers between the observed and simulated stations. This is necessary because observed stations use \"real-world\" identifiers, while distributed simulations often employ coded or sequentially numbered identifiers.\n", "- **observation_stations**: A list of the stations from the observation set that we want to use to build the optimal interpolation.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "qobs" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "qsim" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "station_correspondence" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "print(\n", " f\"There are a total of {len(observation_stations)} selected observation stations.\"\n", ")\n", "print(observation_stations)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "