{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Regional frequency analyses" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "import matplotlib.pyplot as plt\n", "import xdatasets\n", "from lmoments3.distr import KappaGen\n", "from sklearn.cluster import HDBSCAN, OPTICS, AgglomerativeClustering\n", "\n", "import xhydro as xh\n", "import xhydro.frequency_analysis as xhfa\n", "import xhydro.gis as xhgis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "This notebook will demonstrate how to use the `xHydro` library to perform regional frequency analyses on a dataset of streamflow data. Since the initial steps for regional frequency analysis are the same as those for local frequency analysis, users are encouraged to refer to the [Local frequency analysis](local_frequency_analysis.ipynb) notebook for an overview.\n", "\n", "For this example, we will use the same dataset of hydrometric gauges covering parts of southern Quebec, ensuring continuity with the previous analysis while expanding to a regional scale. However, as regional analyses rely on having access to multiple sources of data, we will extract streamflow for 15 stations." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = (\n", " xdatasets.Query(\n", " **{\n", " \"datasets\": {\n", " \"deh\": {\n", " \"id\": [\"02*\"],\n", " \"regulated\": [\"Natural\"],\n", " \"variables\": [\"streamflow\"],\n", " }\n", " },\n", " \"time\": {\"start\": \"1970-01-01\", \"minimum_duration\": (30 * 365, \"d\")},\n", " }\n", " )\n", " .data.squeeze()\n", " .load()\n", ")\n", "\n", "# This dataset lacks some attributes, so let's add them.\n", "ds = ds.rename({\"streamflow\": \"q\"})\n", "ds[\"id\"].attrs[\"cf_role\"] = \"timeseries_id\"\n", "ds[\"q\"].attrs = {\n", " \"long_name\": \"Streamflow\",\n", " \"units\": \"m3 s-1\",\n", " \"standard_name\": \"water_volume_transport_in_river_channel\",\n", " \"cell_methods\": \"time: mean\",\n", "}\n", "\n", "# Clean some of the coordinates that are not needed for this example\n", "ds = ds.drop_vars([c for c in ds.coords if c not in [\"id\", \"time\", \"name\"]])\n", "\n", "timeargs = {\n", " \"annual\": {},\n", "}\n", "\n", "ds_4fa = xh.indicators.get_yearly_op(\n", " ds, op=\"max\", timeargs=timeargs, missing=\"pct\", missing_options={\"tolerance\": 0.15}\n", ")\n", "ds_4fa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Explanatory variables\n", "\n", "In regional frequency analyses, explanatory variables are used to help explain the spatial variation in hydrological extremes across different locations. These variables can include factors such as catchment area, elevation, precipitation, land use, and soil type, among others. By incorporating explanatory variables, we can account for the influence of geographic and environmental characteristics on extreme events, allowing for more accurate regional predictions.\n", "\n", "### a) Extraction of watershed characteristics using `xhydro.gis`\n", "\n", "In this example, we'll use catchment properties as our primary explanatory variables. However, other variables, such as climatological averages or land use data, can also be incorporated depending on the analysis requirements. For detailed steps on how to extract and work with these data, refer to the [GIS operations](gis.ipynb) notebook.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "gdf = xdatasets.Query(\n", " **{\n", " \"datasets\": {\n", " \"deh_polygons\": {\n", " \"id\": [\"02*\"],\n", " \"regulated\": [\"Natural\"],\n", " \"variables\": [\"streamflow\"],\n", " }\n", " },\n", " \"time\": {\"start\": \"1970-01-01\", \"minimum_duration\": (30 * 365, \"d\")},\n", " }\n", ").data.reset_index()\n", "\n", "dswp = xhgis.watershed_properties(\n", " gdf[[\"Station\", \"geometry\"]], unique_id=\"Station\", output_format=\"xarray\"\n", ")\n", "dswp" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b) Principal component analysis\n", "\n", "After acquiring the explanatory variables, the next step is to process them using Principal Component Analysis (PCA) to reduce the dimensionality of the dataset. PCA helps simplify the dataset by transforming the original variables into a smaller set of uncorrelated components while retaining most of the variation in the data. This is accomplished using the function `xhydro.frequency_analysis.regional.fit_pca`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(xhfa.regional.fit_pca)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_pca, pca = xhfa.regional.fit_pca(dswp, n_components=3)\n", "data_pca" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The results show that the correlation between the components is close to 0, indicating that the first three components are sufficiently independent. This suggests that these components can be effectively used for the remainder of our analysis, as they capture most of the variation in the explanatory variables without significant overlap or multicollinearity.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data_pca.to_dataframe(name=\"value\").reset_index().pivot(\n", " index=\"Station\", columns=\"components\"\n", ").corr()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) Clustering\n", "\n", "The results from the PCA can be used to group the stations into clusters based on their similarities in the principal components. Clustering helps identify regions with similar characteristics, enabling more targeted and accurate regional frequency analyses. This step is accomplished using `xhydro.frequency_analysis.regional.get_clusters`, which supports clustering methods from `sklearn.cluster`. In this example, we will use `AgglomerativeClustering` to form 3 clusters based on the PCA results.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(xhfa.regional.get_clusters)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "groups = xhfa.regional.get_clusters(\n", " AgglomerativeClustering, {\"n_clusters\": 3}, data_pca\n", ")\n", "groups" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ax = plt.subplot(1, 1, 1)\n", "gdf[gdf[\"Station\"].isin(groups[0])].plot(ax=ax, color=\"red\")\n", "gdf[gdf[\"Station\"].isin(groups[1])].plot(ax=ax, color=\"green\")\n", "gdf[gdf[\"Station\"].isin(groups[2])].plot(ax=ax, color=\"blue\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Regional Frequency Analysis\n", "\n", "**Hosking and Wallis** introduced a method for regional frequency analysis based on L-moments, which are particularly useful for analyzing extreme events across different regions. Here’s a brief overview of the key concepts:\n", "\n", "1. **L-Moments**: L-moments are statistical measures that are derived from linear combinations of order statistics. They offer a more robust alternative to traditional moments (like mean and variance), especially in the presence of outliers or when dealing with small sample sizes. L-moments are widely used in hydrological studies for estimating the characteristics of extreme values, as they provide more stable and reliable estimates.\n", "\n", "2. **Regional Frequency Analysis**: This technique involves pooling data from multiple sites or regions to estimate the frequency distribution of extreme events, such as floods or droughts. The approach allows for the determination of common statistical parameters across different locations, enabling a more generalized and regional perspective on extreme event behavior. Hosking and Wallis's method focuses on fitting a regional frequency distribution and assessing how well it captures the characteristics of the data.\n", "\n", "3. **Regional L-Moments**: These are L-moments calculated from pooled data across multiple sites within a region. By applying regional L-moments, we can estimate the parameters of regional frequency distributions and better understand the occurrence of extreme events in different parts of the region. This method improves the accuracy and robustness of predictions for extreme events, particularly when data from multiple sites are available.\n", "\n", "Let's start by calculating the L-moments for each station using `xhydro.frequency_analysis.regional.calc_moments`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(xhfa.regional.calc_moments)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_moments = xhfa.regional.calc_moments(ds_4fa)\n", "ds_moments" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "To perform regional frequency analysis, we need to reshape our datasets of annual maximums and L-moments based on the groupings identified by the clustering algorithm. Since there is no standardized naming convention for this new dimension, `xHydro` uses the name `region_id` for the cluster groupings. This reshaping process is handled by the function `xhydro.frequency_analysis.regional.group_ds_by_regions`, which organizes the data according to the cluster assignments and prepares it for further analysis.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "help(xhfa.regional.group_ds_by_regions)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_groups = xhfa.regional.group_ds_by_regions(ds_4fa, regions=groups)\n", "ds_moments_groups = xhfa.regional.group_ds_by_regions(ds_moments, regions=groups)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "We now need to compute two important scores for evaluating the regional frequency analysis: the *H-Score* and the *Z-Score*. The H-Score measures the homogeneity of data across different sites or regions relative to the regional model. It helps determine how well the data from multiple sites align with the regional distribution. The Z-Score quantifies the discrepancy between the observed and expected values, standardized by their variability. It helps determine how well the theoretical distribution (based on the regional model) fits the observed data.\n", "\n", "- **H-Score (Homogeneity Score)**\n", "\n", " - **H < 1: Homogeneous** \n", " This indicates that the data from different sites are quite similar and align well with the regional model, suggesting that the regional model is suitable for the entire region.\n", "\n", " - **1 ≤ H < 2: Maybe Homogeneous** \n", " This range suggests some heterogeneity, meaning the data might still generally fit the regional model, but there are some variations that the model does not fully capture.\n", "\n", " - **H ≥ 2: Heterogeneous** \n", " A score this high indicates significant differences between sites or regions, suggesting that the regional model might not be appropriate for all the data. The region may be too diverse or require model adjustments.\n", "\n", "- **Z-Score (Goodness of Fit)**\n", "\n", " - **Low Z-Score**: A low Z-Score suggests a good fit between the model and the observed data. Typically, an absolute Z-Score below 1.64 indicates that the model is appropriate and the fit is statistically acceptable.\n", "\n", " - **High Z-Score**: A high Z-Score suggests significant discrepancies between observed and expected values. An absolute Z-Score above 1.64 implies that the model might not fit the data well, and further adjustments or a different model might be needed.\n", "\n", "The function `xhydro.frequency_analysis.regional.calc_h_z` is used to compute the *H-Score* and *Z-Score* for the regional frequency analysis.\n", "\n", "