{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# Regional frequency analyses"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "import xdatasets\n",
    "from lmoments3.distr import KappaGen\n",
    "from sklearn.cluster import HDBSCAN, OPTICS, AgglomerativeClustering\n",
    "\n",
    "import xhydro as xh\n",
    "import xhydro.frequency_analysis as xhfa\n",
    "import xhydro.gis as xhgis"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "This notebook will demonstrate how to use the `xHydro` library to perform regional frequency analyses on a dataset of streamflow data. Since the initial steps for regional frequency analysis are the same as those for local frequency analysis, users are encouraged to refer to the [Local frequency analysis](local_frequency_analysis.ipynb) notebook for an overview.\n",
    "\n",
    "For this example, we will use the same dataset of hydrometric gauges covering parts of southern Quebec, ensuring continuity with the previous analysis while expanding to a regional scale. However, as regional analyses rely on having access to multiple sources of data, we will extract streamflow for 15 stations."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds = (\n",
    "    xdatasets.Query(\n",
    "        **{\n",
    "            \"datasets\": {\n",
    "                \"deh\": {\n",
    "                    \"id\": [\"02*\"],\n",
    "                    \"regulated\": [\"Natural\"],\n",
    "                    \"variables\": [\"streamflow\"],\n",
    "                }\n",
    "            },\n",
    "            \"time\": {\"start\": \"1970-01-01\", \"minimum_duration\": (30 * 365, \"d\")},\n",
    "        }\n",
    "    )\n",
    "    .data.squeeze()\n",
    "    .load()\n",
    ")\n",
    "\n",
    "# This dataset lacks some attributes, so let's add them.\n",
    "ds = ds.rename({\"streamflow\": \"q\"})\n",
    "ds[\"id\"].attrs[\"cf_role\"] = \"timeseries_id\"\n",
    "ds[\"q\"].attrs = {\n",
    "    \"long_name\": \"Streamflow\",\n",
    "    \"units\": \"m3 s-1\",\n",
    "    \"standard_name\": \"water_volume_transport_in_river_channel\",\n",
    "    \"cell_methods\": \"time: mean\",\n",
    "}\n",
    "\n",
    "# Clean some of the coordinates that are not needed for this example\n",
    "ds = ds.drop_vars([c for c in ds.coords if c not in [\"id\", \"time\", \"name\"]])\n",
    "\n",
    "timeargs = {\n",
    "    \"annual\": {},\n",
    "}\n",
    "\n",
    "ds_4fa = xh.indicators.get_yearly_op(\n",
    "    ds, op=\"max\", timeargs=timeargs, missing=\"pct\", missing_options={\"tolerance\": 0.15}\n",
    ")\n",
    "ds_4fa"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Explanatory variables\n",
    "\n",
    "In regional frequency analyses, explanatory variables are used to help explain the spatial variation in hydrological extremes across different locations. These variables can include factors such as catchment area, elevation, precipitation, land use, and soil type, among others. By incorporating explanatory variables, we can account for the influence of geographic and environmental characteristics on extreme events, allowing for more accurate regional predictions.\n",
    "\n",
    "### a) Extraction of watershed characteristics using `xhydro.gis`\n",
    "\n",
    "In this example, we'll use catchment properties as our primary explanatory variables. However, other variables, such as climatological averages or land use data, can also be incorporated depending on the analysis requirements. For detailed steps on how to extract and work with these data, refer to the [GIS operations](gis.ipynb) notebook.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "gdf = xdatasets.Query(\n",
    "    **{\n",
    "        \"datasets\": {\n",
    "            \"deh_polygons\": {\n",
    "                \"id\": [\"02*\"],\n",
    "                \"regulated\": [\"Natural\"],\n",
    "                \"variables\": [\"streamflow\"],\n",
    "            }\n",
    "        },\n",
    "        \"time\": {\"start\": \"1970-01-01\", \"minimum_duration\": (30 * 365, \"d\")},\n",
    "    }\n",
    ").data.reset_index()\n",
    "\n",
    "dswp = xhgis.watershed_properties(\n",
    "    gdf[[\"Station\", \"geometry\"]], unique_id=\"Station\", output_format=\"xarray\"\n",
    ")\n",
    "dswp"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### b) Principal component analysis\n",
    "\n",
    "After acquiring the explanatory variables, the next step is to process them using Principal Component Analysis (PCA) to reduce the dimensionality of the dataset. PCA helps simplify the dataset by transforming the original variables into a smaller set of uncorrelated components while retaining most of the variation in the data. This is accomplished using the function `xhydro.frequency_analysis.regional.fit_pca`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.fit_pca)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_pca, pca = xhfa.regional.fit_pca(dswp, n_components=3)\n",
    "data_pca"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The results show that the correlation between the components is close to 0, indicating that the first three components are sufficiently independent. This suggests that these components can be effectively used for the remainder of our analysis, as they capture most of the variation in the explanatory variables without significant overlap or multicollinearity.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_pca.to_dataframe(name=\"value\").reset_index().pivot(\n",
    "    index=\"Station\", columns=\"components\"\n",
    ").corr()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### c) Clustering\n",
    "\n",
    "The results from the PCA can be used to group the stations into clusters based on their similarities in the principal components. Clustering helps identify regions with similar characteristics, enabling more targeted and accurate regional frequency analyses. This step is accomplished using `xhydro.frequency_analysis.regional.get_clusters`, which supports clustering methods from `sklearn.cluster`. In this example, we will use `AgglomerativeClustering` to form 3 clusters based on the PCA results.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.get_clusters)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups = xhfa.regional.get_clusters(\n",
    "    AgglomerativeClustering, {\"n_clusters\": 3}, data_pca\n",
    ")\n",
    "groups"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ax = plt.subplot(1, 1, 1)\n",
    "gdf[gdf[\"Station\"].isin(groups[0])].plot(ax=ax, color=\"red\")\n",
    "gdf[gdf[\"Station\"].isin(groups[1])].plot(ax=ax, color=\"green\")\n",
    "gdf[gdf[\"Station\"].isin(groups[2])].plot(ax=ax, color=\"blue\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Regional Frequency Analysis\n",
    "\n",
    "**Hosking and Wallis** introduced a method for regional frequency analysis based on L-moments, which are particularly useful for analyzing extreme events across different regions. Here’s a brief overview of the key concepts:\n",
    "\n",
    "1. **L-Moments**: L-moments are statistical measures that are derived from linear combinations of order statistics. They offer a more robust alternative to traditional moments (like mean and variance), especially in the presence of outliers or when dealing with small sample sizes. L-moments are widely used in hydrological studies for estimating the characteristics of extreme values, as they provide more stable and reliable estimates.\n",
    "\n",
    "2. **Regional Frequency Analysis**: This technique involves pooling data from multiple sites or regions to estimate the frequency distribution of extreme events, such as floods or droughts. The approach allows for the determination of common statistical parameters across different locations, enabling a more generalized and regional perspective on extreme event behavior. Hosking and Wallis's method focuses on fitting a regional frequency distribution and assessing how well it captures the characteristics of the data.\n",
    "\n",
    "3. **Regional L-Moments**: These are L-moments calculated from pooled data across multiple sites within a region. By applying regional L-moments, we can estimate the parameters of regional frequency distributions and better understand the occurrence of extreme events in different parts of the region. This method improves the accuracy and robustness of predictions for extreme events, particularly when data from multiple sites are available.\n",
    "\n",
    "Let's start by calculating the L-moments for each station using `xhydro.frequency_analysis.regional.calc_moments`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.calc_moments)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_moments = xhfa.regional.calc_moments(ds_4fa)\n",
    "ds_moments"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "To perform regional frequency analysis, we need to reshape our datasets of annual maximums and L-moments based on the groupings identified by the clustering algorithm. Since there is no standardized naming convention for this new dimension, `xHydro` uses the name `region_id` for the cluster groupings. This reshaping process is handled by the function `xhydro.frequency_analysis.regional.group_ds_by_regions`, which organizes the data according to the cluster assignments and prepares it for further analysis.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.group_ds_by_regions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_groups = xhfa.regional.group_ds_by_regions(ds_4fa, regions=groups)\n",
    "ds_moments_groups = xhfa.regional.group_ds_by_regions(ds_moments, regions=groups)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We now need to compute two important scores for evaluating the regional frequency analysis: the *H-Score* and the *Z-Score*. The H-Score measures the homogeneity of data across different sites or regions relative to the regional model. It helps determine how well the data from multiple sites align with the regional distribution. The Z-Score quantifies the discrepancy between the observed and expected values, standardized by their variability. It helps determine how well the theoretical distribution (based on the regional model) fits the observed data.\n",
    "\n",
    "- **H-Score (Homogeneity Score)**\n",
    "\n",
    "  - **H < 1: Homogeneous**  \n",
    "  This indicates that the data from different sites are quite similar and align well with the regional model, suggesting that the regional model is suitable for the entire region.\n",
    "\n",
    "  - **1 ≤ H < 2: Maybe Homogeneous**  \n",
    "  This range suggests some heterogeneity, meaning the data might still generally fit the regional model, but there are some variations that the model does not fully capture.\n",
    "\n",
    "  - **H ≥ 2: Heterogeneous**  \n",
    "  A score this high indicates significant differences between sites or regions, suggesting that the regional model might not be appropriate for all the data. The region may be too diverse or require model adjustments.\n",
    "\n",
    "- **Z-Score (Goodness of Fit)**\n",
    "\n",
    "  - **Low Z-Score**: A low Z-Score suggests a good fit between the model and the observed data. Typically, an absolute Z-Score below 1.64 indicates that the model is appropriate and the fit is statistically acceptable.\n",
    "\n",
    "  - **High Z-Score**: A high Z-Score suggests significant discrepancies between observed and expected values. An absolute Z-Score above 1.64 implies that the model might not fit the data well, and further adjustments or a different model might be needed.\n",
    "\n",
    "The function `xhydro.frequency_analysis.regional.calc_h_z` is used to compute the *H-Score* and *Z-Score* for the regional frequency analysis.\n",
    "\n",
    "<div class=\"alert alert-warning\"><b>WARNING</b>\n",
    "    \n",
    "  The `kap` argument in the `xhydro.frequency_analysis.regional.calc_h_z` function expects a Kappa3 distribution object generated using `lmoments3.distr.KappaGen()`. However, due to licensing issues, the `lmoments3` library cannot currently be included in the `xHydro` requirements and must be installed independently.\n",
    "  \n",
    "</div>\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.calc_h_z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_hz = xhfa.regional.calc_h_z(ds_groups, ds_moments_groups, kap=KappaGen())\n",
    "ds_hz"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "We can filter the results to include only the data that has H and Z scores below given thresholds using the function `xhydro.frequency_analysis.regional.mask_h_z`. By default, the thresholds are set to 1 for H and 1.64 for Z. However, these thresholds can be customized to suit the specific requirements of the analysis.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.mask_h_z)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "mask = xhfa.regional.mask_h_z(ds_hz)\n",
    "ds_groups_h1 = ds_groups.where(mask).load()\n",
    "ds_moments_groups_h1 = ds_moments_groups.where(mask).load()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The final step of the regional frequency analysis is similar to the local frequency analysis, but instead of using `xhydro.frequency_analysis.local.parametric_quantiles`, we use `xhydro.frequency_analysis.regional.calculate_return_period` to calculate the return periods. Additionally, to avoid performing the analysis on very small regions, the `remove_small_regions` argument can be used to exclude any regions with less than a specified number of stations. By default, this threshold is set to 5.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.calculate_return_period)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.regional.remove_small_regions)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q_t = xhfa.regional.calculate_return_period(\n",
    "    ds_groups_h1, ds_moments_groups_h1, return_period=[2, 20, 100]\n",
    ")\n",
    "q_t"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q_t = xhfa.regional.remove_small_regions(q_t)\n",
    "q_t"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Let's plot results for one station, comparing local and regional analyses."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q_reg = q_t.sel(id=\"023401\").dropna(dim=\"region_id\", how=\"all\")\n",
    "reg = q_reg.q_max_annual.squeeze()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "params_loc = xhfa.local.fit(ds_4fa.sel(id=\"023401\"))\n",
    "q_loc = xhfa.local.parametric_quantiles(params_loc, return_period=[2, 20, 100])\n",
    "loc = q_loc.sel(scipy_dist=\"genextreme\").q_max_annual"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig = plt.figure(figsize=(15, 4))\n",
    "plt.plot(reg.return_period.values, reg.values, \"red\", label=\"Regional analysis\")\n",
    "plt.plot(loc.return_period.values, loc.values, \"black\", label=\"Local analysis\")\n",
    "plt.xscale(\"log\")\n",
    "plt.grid(visible=True)\n",
    "plt.xlabel(\"Return period (years)\")\n",
    "plt.ylabel(\"Streamflow (m$^3$/s)\")\n",
    "plt.legend()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## Uncertainties\n",
    "\n",
    "Uncertainties are an important aspect of frequency analysis and should be considered when interpreting results. These uncertainties often stem from data quality, the choice of distribution, and the estimation of parameters. While visualizations can provide insights into the model fit, it’s crucial to quantify and account for uncertainties, such as confidence intervals for parameter estimates, to ensure robust conclusions.\n",
    "\n",
    "### a) Bootstrapping the observations\n",
    "\n",
    "One method for quantifying uncertainties is to bootstrap the observations. In this example, we will perform bootstrapping a small number of times to illustrate the process, though in practice, a higher number of iterations (e.g., 5000) is recommended to obtain more reliable estimates. Bootstrapping resamples the observed data with replacement to generate multiple datasets, which can then be used to assess the variability in the model's parameters and results.\n",
    "\n",
    "This can be accomplished by calling `xhydro.frequency_analysis.uncertainties.bootstrap_obs`. One key difference with local analyses is that for regional analyses, no fitting step is involved, making this function significantly faster. Instead, the results are obtained by first using `xhydro.frequency_analysis.uncertainties.calc_moments_iter` to compute L-moments for all bootstrap samples. Then, `xhydro.frequency_analysis.uncertainties.calculate_quantiles_over_boostraped_groups` is used to calculate the return periods based on those L-moments."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.uncertainties.calculate_quantiles_over_boostraped_groups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.uncertainties.calculate_quantiles_over_boostraped_groups)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_reg_boot = xhfa.uncertainties.bootstrap_obs(ds_4fa, n_samples=35)\n",
    "ds_moments_iter = xhfa.uncertainties.calc_moments_iter(ds_reg_boot).load()\n",
    "ds_moments_iter"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q_reg_boot = xhfa.uncertainties.calculate_quantiles_over_boostraped_groups(\n",
    "    \"023401\",\n",
    "    ds_groups_h1,\n",
    "    ds_moments_iter,\n",
    "    return_period=[2, 20, 100],\n",
    ")\n",
    "q_reg_boot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### b) Using multiple regions\n",
    "\n",
    "Another approach to estimating uncertainty is by considering multiple regions for a single catchment of interest. This can be done by applying different clustering methods or by performing a jackknife procedure on the station list. While this process can be computationally intensive, we present a simplified example here to illustrate the potential of this method. The main objective is to demonstrate how considering multiple regions can help in evaluating uncertainty in the regional frequency analysis.\n",
    "\n",
    "First, we will use `xhydro.frequency_analysis.uncertainties.generate_combinations` on the results from our PCA to generate lists of stations. This function creates combinations that exclude a certain number of stations each time, allowing us to evaluate how the exclusion of different stations impacts the groupings.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "help(xhfa.uncertainties.generate_combinations)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "combinations_list = xhfa.uncertainties.generate_combinations(data_pca, n=2)\n",
    "print(\n",
    "    f\"This generated {len(combinations_list)} lists of stations, varying from {len(combinations_list[0])} to {len(combinations_list[-1])} stations.\"\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Next, we will combine those lists with three clustering methods and, for each method, we'll try to change some of the parameters.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "clust_options = {\n",
    "    AgglomerativeClustering: {\"arg_name\": \"n_clusters\", \"range\": range(2, 12)},\n",
    "    HDBSCAN: {\"arg_name\": \"min_cluster_size\", \"range\": range(6, 7)},\n",
    "    OPTICS: {\"arg_name\": \"min_samples\", \"range\": range(4, 5)},\n",
    "}"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "groups = []\n",
    "\n",
    "for model in [AgglomerativeClustering, HDBSCAN, OPTICS]:\n",
    "\n",
    "    for p in clust_options[model][\"range\"]:\n",
    "        d_param = {}\n",
    "        d_param[clust_options[model][\"arg_name\"]] = p\n",
    "        for combination in combinations_list:\n",
    "            # Extract data for the current combination\n",
    "            data_com = data_pca.sel(Station=list(combination))\n",
    "            # Get groups from the fit and add to the list\n",
    "            groups = groups + xhfa.regional.get_clusters(model, d_param, data_com)\n",
    "unique_groups = [list(x) for x in {tuple(x) for x in groups}]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "The next steps follow a similar approach to the previous ones, but now with multiple regions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "ds_groups_multiregion = xhfa.regional.group_ds_by_regions(ds_4fa, regions=unique_groups)\n",
    "ds_moments_groups_multiregion = xhfa.regional.group_ds_by_regions(\n",
    "    ds_moments, regions=unique_groups\n",
    ")\n",
    "ds_h_z_multiregion = xhfa.regional.calc_h_z(\n",
    "    ds_groups_multiregion, ds_moments_groups_multiregion, kap=KappaGen()\n",
    ")\n",
    "mask_multiregion = xhfa.regional.mask_h_z(ds_h_z_multiregion)\n",
    "ds_groups_h1_multiregion = ds_groups_multiregion.where(mask_multiregion).load()\n",
    "ds_moments_groups_h1_multiregion = ds_moments_groups_multiregion.where(\n",
    "    mask_multiregion\n",
    ").load()\n",
    "\n",
    "q_t_multiregion = xhfa.regional.calculate_return_period(\n",
    "    ds_groups_h1_multiregion,\n",
    "    ds_moments_groups_h1_multiregion,\n",
    "    return_period=[2, 20, 100],\n",
    ")\n",
    "q_t_multiregion = xhfa.regional.remove_small_regions(q_t_multiregion)\n",
    "\n",
    "q = q_t_multiregion.sel(id=\"023401\").dropna(dim=\"region_id\", how=\"all\")\n",
    "q"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### c) Combining the bootstrap with multiple regions\n",
    "\n",
    "It is possible to combine both methods—bootstrapping and multiple regions—by integrating the following:\n",
    "\n",
    "1. Bootstrapped L-moments from `xhydro.frequency_analysis.uncertainties.calc_moments_iter`.\n",
    "2. Masked results from `xhydro.frequency_analysis.regional.group_ds_by_regions` that have been filtered based on the H and Z-Scores.\n",
    "\n",
    "When combining these methods, `xhydro.frequency_analysis.uncertainties.calculate_quantiles_over_boostraped_groups` will check in how many distinct `region_id` values the station is present and will stack the data accordingly with the bootstrap samples. For instance, if we have 35 bootstraps and roughly 220 regions, this will generate a total of around 7,700 samples for further analysis.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "q_reg_multiregion_boot = xhfa.uncertainties.calculate_quantiles_over_boostraped_groups(\n",
    "    \"023401\",\n",
    "    ds_groups_h1_multiregion,\n",
    "    ds_moments_iter,\n",
    "    return_period=[2, 20, 100],\n",
    ")\n",
    "q_reg_multiregion_boot"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### d) Comparison\n",
    "\n",
    "Let's show the difference between approaches."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "fig, ax = plt.subplots(1, 3)\n",
    "fig.set_figheight(4)\n",
    "fig.set_figwidth(15)\n",
    "\n",
    "# Subset the data\n",
    "q_reg_boots = q_reg_boot.q_max_annual.sel(id=\"023401\")\n",
    "q_t_multiregions = q_t_multiregion.q_max_annual.sel(id=\"023401\")\n",
    "q_t_multiregion_boots = q_reg_multiregion_boot.q_max_annual.sel(id=\"023401\")\n",
    "\n",
    "\n",
    "def _make_plot(data, dim, label):\n",
    "    # Original fit\n",
    "    ax.plot(\n",
    "        loc.return_period.values,\n",
    "        loc,\n",
    "        \"black\",\n",
    "        label=\"Local frequency analysis\",\n",
    "    )\n",
    "    ax.plot(\n",
    "        reg.return_period.values,\n",
    "        reg,\n",
    "        \"red\",\n",
    "        label=\"Original regional frequency analysis\",\n",
    "    )\n",
    "\n",
    "    ax.plot(\n",
    "        data.return_period.values,\n",
    "        data.quantile(0.5, dim),\n",
    "        \"green\",\n",
    "        label=label,\n",
    "    )\n",
    "    data_05 = data.quantile(0.05, dim)\n",
    "    data_95 = data.quantile(0.95, dim)\n",
    "    ax.fill_between(\n",
    "        data.return_period.values, data_05, data_95, alpha=0.2, color=\"green\"\n",
    "    )\n",
    "\n",
    "    plt.xscale(\"log\")\n",
    "    plt.grid(visible=True)\n",
    "    plt.xlabel(\"Return period (years)\")\n",
    "    plt.ylabel(\"Streamflow (m$^3$/s)\")\n",
    "    plt.ylim([150, 500])\n",
    "    ax.legend()\n",
    "\n",
    "\n",
    "ax = plt.subplot(1, 3, 1)\n",
    "_make_plot(q_reg_boots, \"samples\", \"Bootstrapped observations\")\n",
    "ax = plt.subplot(1, 3, 2)\n",
    "_make_plot(q_t_multiregions, \"region_id\", \"Multiple regions\")\n",
    "ax = plt.subplot(1, 3, 3)\n",
    "_make_plot(q_t_multiregion_boots, \"samples\", \"Combined\")"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.13.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}