6. Local frequency analyses¶

[1]:

# Basic imports
import hvplot.xarray  # noqa
import numpy as np
import xarray as xr
import xdatasets

import xhydro as xh
import xhydro.frequency_analysis as xhfa

ERROR 1: PROJ: proj_create_from_database: Open of /home/docs/checkouts/readthedocs.org/user_builds/xhydro/conda/stable/share/proj failed

Redefining 'percent' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '%' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'year' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'yr' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'C' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'd' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'h' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_north' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_east' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '[speed]' (<class 'pint.delegates.txt_defparser.plain.DerivedDimensionDefinition'>)

/home/docs/checkouts/readthedocs.org/user_builds/xhydro/conda/stable/lib/python3.14/site-packages/xhydro/__init__.py:21: UserWarning: The `exactextract` library is not present in the environment and will not be used.

6.3. Local frequency analysis¶

After extracting the raw data, such as annual maximums or minimums, the local frequency analysis is performed in three steps:

Use xhydro.frequency_analysis.local.fit to determine the best set of parameters for a given number of statistical distributions.
(Optional) Use xhydro.frequency_analysis.local.criteria to compute goodness-of-fit criteria and assess how well each statistical distribution fits the data.
Use xhydro.frequency_analysis.local.parametric_quantiles to calculate return levels based on the fitted parameters.

To speed up the Notebook, we’ll only perform the analysis on a subset of variables.

[15]:

help(xhfa.local.fit)

Help on function fit in module xhydro.frequency_analysis.local:

fit(
    ds,
    *,
    distributions: list[str] | None = None,
    min_years: int | None = None,
    method: str = 'ML',
    periods: list[str] | list[list[str]] | None = None
) -> xr.Dataset
    Fit multiple distributions to data.

    Parameters
    ----------
    ds : xr.Dataset
        Dataset containing the data to fit. All variables will be fitted.
    distributions : list of str, optional
        List of distribution names as defined in `scipy.stats`. See https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions.
        Defaults to ["genextreme", "pearson3", "gumbel_r", "expon"].
    min_years : int, optional
        Minimum number of years required for a distribution to be fitted.
    method : str
        Fitting method. Defaults to "ML" (maximum likelihood).
    periods : list of str or list of list of str, optional
        Either [start, end] or list of [start, end] of periods to be considered.
        If multiple periods are given, the output will have a `horizon` dimension.
        If None, all data is used.

    Returns
    -------
    xr.Dataset
        Dataset containing the parameters of the fitted distributions, with a new dimension `scipy_dist` containing the distribution names.

    Notes
    -----
    In order to combine the parameters of multiple distributions, the size of the `dparams` dimension is set to the
    maximum number of unique parameters between the distributions.

The fit function enables the fitting of multiple statistical distributions simultaneously, such as ["genextreme", "pearson3", "gumbel_r", "expon"]. Since different distributions have varying parameter sets (and sometimes different naming conventions), xHydro handles this complexity by using a dparams dimension, filling in NaN values where needed. When the results interact with SciPy, such as the parametric_quantiles function, only the relevant parameters for each distribution are passed. The selected distributions are stored in a newly created scipy_dist dimension.

[16]:

params = xhfa.local.fit(ds_4fa[["q_max_spring", "volume_sum_spring"]], min_years=15)

params

/home/docs/checkouts/readthedocs.org/user_builds/xhydro/conda/stable/lib/python3.14/site-packages/xhydro/frequency_analysis/local.py:93: FutureWarning: In a future version of xarray the default value for compat will change from compat='no_conflicts' to compat='override'. This is likely to lead to different results when combining overlapping variables with the same name. To opt in to new defaults and get rid of these warnings now use `set_options(use_new_combine_kwarg_defaults=True) or set compat explicitly.

[16]:

<xarray.Dataset> Size: 820B
Dimensions:            (scipy_dist: 4, id: 2, dparams: 4)
Coordinates:
  * scipy_dist         (scipy_dist) <U10 160B 'genextreme' ... 'expon'
  * id                 (id) object 16B '020602' '020802'
  * dparams            (dparams) <U5 80B 'c' 'skew' 'loc' 'scale'
    horizon            <U9 36B '1970-2025'
    name               (id) object 16B dask.array<chunksize=(2,), meta=np.ndarray>
Data variables:
    q_max_spring       (scipy_dist, id, dparams) float64 256B dask.array<chunksize=(1, 2, 4), meta=np.ndarray>
    volume_sum_spring  (scipy_dist, id, dparams) float64 256B dask.array<chunksize=(1, 2, 4), meta=np.ndarray>
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:

Criteria like AIC (Akaike Information Criterion), BIC (Bayesian Information Criterion), and AICC (Corrected AIC) are valuable tools for comparing the fit of different statistical models. These criteria balance the goodness-of-fit of a model with its complexity, helping to avoid overfitting. AIC and AICC are particularly useful when comparing models with different numbers of parameters, while BIC tends to penalize complexity more heavily, making it more conservative. Lower values of these criteria indicate better model performance, with AICC being especially helpful in small sample sizes. By using these metrics, we can objectively determine the most appropriate model for our data.

These three criteria can be accessed using xhydro.frequency_analysis.local.criteria. The results are added to a new criterion dimension. In this example, the AIC, BIC, and AICC all provide a weak indication that the Generalized Extreme Value (GEV) distribution is the best fit for the data, though the Gumbel distribution may also be a suitable choice. Conversely, the Pearson III failed to converge and the exponential distribution was rejected based on these criteria, suggesting that they do not adequately fit the data.

[17]:

help(xhfa.local.criteria)

Help on function criteria in module xhydro.frequency_analysis.local:

criteria(ds: xr.Dataset, p: xr.Dataset) -> xr.Dataset
    Compute information criteria (AIC, BIC, AICC) from fitted distributions, using the log-likelihood.

    Parameters
    ----------
    ds : xr.Dataset
        Dataset containing the yearly data that was fitted.
    p : xr.Dataset
        Dataset containing the parameters of the fitted distributions.
        Must have a dimension `dparams` containing the parameter names and a dimension `scipy_dist` containing the distribution names.

    Returns
    -------
    xr.Dataset
        Dataset containing the information criteria for the distributions.

[18]:

criteria = xhfa.local.criteria(ds_4fa[["q_max_spring", "volume_sum_spring"]], params)

criteria["q_max_spring"].isel(id=0).compute()

Finally, return periods can be obtained using xhfa.local.parametric_quantiles.

[19]:

help(xhfa.local.parametric_quantiles)

Help on function parametric_quantiles in module xhydro.frequency_analysis.local:

parametric_quantiles(
    p: xr.Dataset,
    return_period: float | list[float],
    mode: str = 'max'
) -> xr.Dataset
    Compute quantiles from fitted distributions.

    Parameters
    ----------
    p : xr.Dataset
        Dataset containing the parameters of the fitted distributions.
        Must have a dimension `dparams` containing the parameter names and a dimension `scipy_dist` containing the distribution names.
    return_period : float or list of float
        Return period(s) in years.
    mode : {'max', 'min'}
        Whether the return period is the probability of exceedance (max) or non-exceedance (min).

    Returns
    -------
    xr.Dataset
        Dataset containing the quantiles of the distributions.

[20]:

rp = xhfa.local.parametric_quantiles(params, return_period=[2, 20, 100])

rp.load()

In a future release, plotting will be managed by a dedicated function. For now, we demonstrate the process using preliminary utilities in this notebook.

The function xhfa.local._prepare_plots generates the data points required to visualize the results of the frequency analysis. If log=True, it will return log-spaced x-values between xmin and xmax. Meanwhile, xhfa.local._get_plotting_positions calculates plotting positions for all variables in the dataset. It accepts alpha and beta arguments. For typical values, refer to the SciPy documentation. By default, (0.4, 0.4) is used, which corresponds to the quantile unbiased method (Cunnane).

[21]:

data = xhfa.local._prepare_plots(params, xmin=1, xmax=1000, npoints=50, log=True)
pp = xhfa.local._get_plotting_positions(ds_4fa[["q_max_spring"]])

[22]:

# Plot the distributions
p1 = data.q_max_spring.hvplot(
    x="return_period", by="scipy_dist", grid=True, groupby=["id"], logx=True
)

# Plot the observations
p2 = pp.hvplot.scatter(
    x="q_max_spring_pp",
    y="q_max_spring",
    grid=True,
    groupby=["id"],
    logx=True,
)

# And now combining the plots
p1 * p2

[22]:

6.4. Uncertainties¶

Uncertainties are an important aspect of frequency analysis and should be considered when interpreting results. These uncertainties often stem from data quality, the choice of distribution, and the estimation of parameters. While visualizations can provide insights into the model fit, it’s crucial to quantify and account for uncertainties, such as confidence intervals for parameter estimates, to ensure robust conclusions.

In order to manage computational intensity, we will focus on a single catchment and limit the analysis to the two distributions that appeared to best fit the data.

[23]:

ds_4fa = ds_4fa.sel(id="020602")[["q_max_spring"]]
params = params.sel(id="020602", scipy_dist=["genextreme", "gumbel_r"])[
    ["q_max_spring"]
]

6.4.2. b) Bootstrapping the distributions¶

In this approach, rather than resampling the observations directly, we resample the fitted distributions to estimate the uncertainty. This method allows us to assess the variability in the fitted distributions’ parameters. As with the previous example, we will perform a minimal number of bootstrap iterations to reduce computational load, but in practice, a higher number of iterations would provide more robust estimates of uncertainty.

This can be accomplished by calling xhydro.frequency_analysis.uncertainties.bootstrap_dist. Unlike bootstrap_obs, this method does not support lazy evaluation and requires a specific function for the fitting step: xhydro.frequency_analysis.uncertainties.fit_boot_dist.

[27]:

help(xhfa.uncertainties.bootstrap_dist)

Help on function bootstrap_dist in module xhydro.frequency_analysis.uncertainties:

bootstrap_dist(ds_obs: xr.Dataset, ds_params: xr.Dataset, *, n_samples: int) -> xr.Dataset
    Generate bootstrap samples from a fitted distribution.

    Parameters
    ----------
    ds_obs : xarray.Dataset
        The observed data.
    ds_params : xarray.Dataset
        The fitted distribution parameters.
    n_samples : int
        The number of bootstrap samples to generate.

    Returns
    -------
    xarray.Dataset
        Bootstrap samples with dimensions [samples, time].

    Notes
    -----
    This function does not support lazy evaluation.

[28]:

help(xhfa.uncertainties.fit_boot_dist)

Help on function fit_boot_dist in module xhydro.frequency_analysis.uncertainties:

fit_boot_dist(ds: xr.Dataset) -> xr.Dataset
    Fit distributions to bootstrap samples.

    Parameters
    ----------
    ds : xarray.Dataset
        The bootstrap samples to fit.

    Returns
    -------
    xarray.Dataset
        Fitted distribution parameters for each bootstrap sample.

[29]:

tmp_values = xhfa.uncertainties.bootstrap_dist(ds_4fa, params, n_samples=35)
params_boot_dist = xhfa.uncertainties.fit_boot_dist(tmp_values)
params_boot_dist

[29]:

<xarray.Dataset> Size: 2kB
Dimensions:       (scipy_dist: 2, samples: 35, dparams: 3)
Coordinates:
  * scipy_dist    (scipy_dist) <U10 80B 'genextreme' 'gumbel_r'
  * samples       (samples) int64 280B 0 1 2 3 4 5 6 7 ... 28 29 30 31 32 33 34
  * dparams       (dparams) <U5 60B 'c' 'loc' 'scale'
    horizon       <U9 36B '1970-2025'
    id            <U6 24B '020602'
    name          object 8B 'Dartmouth'
Data variables:
    q_max_spring  (scipy_dist, samples, dparams) float64 2kB dask.array<chunksize=(1, 35, 2), meta=np.ndarray>

[30]:

rp_boot_dist = xhfa.local.parametric_quantiles(
    params_boot_dist.load(), return_period=[2, 20, 100]
)  # Lazy computing is not supported
rp_boot_dist

6.4.3. c) Comparison¶

Let’s show the difference between both approaches.

[31]:

import matplotlib.pyplot as plt

fig, ax = plt.subplots()
fig.set_figheight(4)
fig.set_figwidth(15)

# Subset the data
rp_orig = rp.q_max_spring.sel(id="020602", scipy_dist="genextreme")
boot_obs = rp_boot_obs.q_max_spring.sel(scipy_dist="genextreme")
boot_dist = rp_boot_dist.q_max_spring.sel(scipy_dist="genextreme")

# Original fit
ax.plot(
    rp_orig.return_period.values,
    rp_orig,
    "black",
    label="Original fit",
)

ax.plot(
    boot_obs.return_period.values,
    boot_obs.quantile(0.5, "samples"),
    "red",
    label="Bootstrapped observations",
)
boot_obs_05 = boot_obs.quantile(0.05, "samples")
boot_obs_95 = boot_obs.quantile(0.95, "samples")
ax.fill_between(
    boot_obs.return_period.values, boot_obs_05, boot_obs_95, alpha=0.2, color="red"
)

ax.plot(
    boot_dist.return_period.values,
    boot_dist.quantile(0.5, "samples"),
    "blue",
    label="Bootstrapped distribution",
)
boot_dist_05 = boot_dist.quantile(0.05, "samples")
boot_dist_95 = boot_dist.quantile(0.95, "samples")
ax.fill_between(
    boot_dist.return_period.values, boot_dist_05, boot_dist_95, alpha=0.2, color="blue"
)

plt.xscale("log")
plt.grid(visible=True)
plt.xlabel("Return period (years)")
plt.ylabel("Streamflow (m$^3$/s)")
ax.legend()

[31]:

<matplotlib.legend.Legend at 0x76cb043bdbe0>

../_images/notebooks_local_frequency_analysis_48_1.png

6. Local frequency analyses¶

6.1. Data extraction and preparation¶

6.2. Acquiring block maxima¶

6.2.1. a) Defining seasons¶

6.2.2. b) Defining criteria for missing values¶

6.2.3. c) Simple example¶

6.2.4. d) Advanced example: Using custom seasons per year or per station¶

6.2.5. e) Alternative variable: Computing volumes¶

6.3. Local frequency analysis¶

6.4. Uncertainties¶

6.4.1. a) Bootstrapping the observations¶

6.4.2. b) Bootstrapping the distributions¶

6.4.3. c) Comparison¶