4. Frequency analysis module

[1]:
# Basic imports
import hvplot.xarray
import numpy as np
import xarray as xr
import xdatasets as xd

import xhydro as xh
import xhydro.frequency_analysis as xhfa
Redefining 'percent' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '%' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'year' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'yr' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'C' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'd' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'h' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_north' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_east' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '[speed]' (<class 'pint.delegates.txt_defparser.plain.DerivedDimensionDefinition'>)

4.1. Extracting and preparing the data

For this example, we’ll conduct a frequency analysis using historical time series from various sites. We begin by obtaining a dataset comprising hydrological information. Here, we use the xdataset library to acquire hydrological data from the Ministère de l’Environnement, de la Lutte contre les changements climatiques, de la Faune et des Parcs in Québec, Canada. Specifically, our query focuses on stations with IDs beginning with 020, possessing a natural flow pattern and limited to streamflow data.

Users may prefer to generate their own xarray.DataArray using their individual dataset. At a minimum, the xarray.DataArray used for frequency analysis has to follow these principles:

  • The dataset needs a time dimension.

  • If there is a spatial dimension, such as id in the example below, it needs an attribute cf_role with timeseries_id as its value.

  • The variable will at the very least need a units attribute, although other attributes such as long_name and cell_methods are also expected by xclim (which is called at various points during the frequency analysis) and warnings will be generated if they are missing.

[2]:
ds = (
    xd.Query(
        **{
            "datasets": {
                "deh": {
                    "id": ["020*"],
                    "regulated": ["Natural"],
                    "variables": ["streamflow"],
                }
            },
            "time": {"start": "1970-01-01", "minimum_duration": (15 * 365, "d")},
        }
    )
    .data.squeeze()
    .load()
)

# This dataset lacks some of the aforementioned attributes, so we need to add them.
ds["id"].attrs["cf_role"] = "timeseries_id"
ds["streamflow"].attrs = {
    "long_name": "Streamflow",
    "units": "m3 s-1",
    "standard_name": "water_volume_transport_in_river_channel",
    "cell_methods": "time: mean",
}

ds
[2]:
<xarray.Dataset> Size: 574kB
Dimensions:        (id: 5, time: 20454)
Coordinates: (12/15)
    drainage_area  (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date       (id) datetime64[ns] 40B 2006-10-13 2024-02-27 ... 1996-08-13
  * id             (id) object 40B '020302' '020404' '020502' '020602' '020802'
    latitude       (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude      (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name           (id) object 40B 'Saint' 'York' ... 'Dartmouth' 'Madeleine'
    ...             ...
    spatial_agg    <U9 36B 'watershed'
    start_date     (id) datetime64[ns] 40B 1989-08-12 1980-10-01 ... 1970-01-01
  * time           (time) datetime64[ns] 164kB 1970-01-01 ... 2025-12-31
    time_agg       <U4 16B 'mean'
    timestep       <U1 4B 'D'
    variable       <U10 40B 'streamflow'
Data variables:
    streamflow     (id, time) float32 409kB nan nan nan nan ... nan nan nan nan
[3]:
ds.streamflow.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[3]:

4.2. Customizing the analysis settings

4.2.1. a) Defining seasons

We can define seasons using indexers that are compatible with xclim.core.calendar.select_time. There are currently four accepted types of indexers:

  • month, followed by a sequence of month numbers.

  • season, followed by one or more of 'DJF', 'MAM', 'JJA', and 'SON'.

  • doy_bounds, followed by a sequence representing the inclusive bounds of the period to be considered ("start", "end").

  • date_bounds, which is the same as above, but using a month-day ('%m-%d') format.

For the purpose of getting block maxima through xhydro.indicators.get_yearly_op, the indexers need to be grouped within a dictionary, with the key being the label to be given to the requested period of the year. A second key can be used to instruct on the resampling frequency, for example to wrap around the year for winter.

[4]:
# Some examples
timeargs = {
    "spring": {"date_bounds": ["02-11", "06-19"]},
    "summer": {"doy_bounds": [152, 243]},
    "fall": {"month": [9, 10, 11]},
    "winter": {
        "season": ["DJF"],
        "freq": "YS-DEC",
    },  # To correctly wrap around the year, we need to specify the resampling frequency.
    "august": {"month": [8]},
    "annual": {},
}

4.2.2. b) Getting block maxima

Upon selecting each desired season, we can extract block maxima timeseries from every station using xhydro.indicators.get_yearly_op. The main arguments are:

  • op: the operation to compute. One of "max", "min", "mean", or "sum".

  • input_var: the name of the variable. Defaults to "streamflow".

  • window: the size of the rolling window. A "mean" is performed on the rolling window prior to the op operation.

  • timeargs: as defined previously. Leave at None to get the annual maxima.

  • missing and missing_options: to define tolerances for missing data. See this page for more information.

  • interpolate_na: whether to interpolate missing data prior to the op operation. Only used for sum.

The function returns a xarray.Dataset with 1 variable per indexer.

[5]:
# Here, we hide years with more than 15% of missing data.
ds_4fa = xh.indicators.get_yearly_op(
    ds, op="max", timeargs=timeargs, missing="pct", missing_options={"tolerance": 0.15}
)

ds_4fa
[5]:
<xarray.Dataset> Size: 8kB
Dimensions:                (id: 5, time: 56)
Coordinates: (12/15)
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
  * id                     (id) object 40B '020302' '020404' ... '020802'
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name                   (id) object 40B 'Saint' 'York' ... 'Madeleine'
    ...                     ...
    spatial_agg            <U9 36B 'watershed'
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * time                   (time) datetime64[ns] 448B 1970-01-01 ... 2025-01-01
Data variables:
    streamflow_max_spring  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_summer  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_fall    (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_august  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_annual  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_winter  (id, time) float32 1kB nan nan nan ... nan nan nan
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                
[6]:
ds_4fa.streamflow_max_summer.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[6]:

4.2.3. c) Using custom seasons per year or per station

Using individualized date ranges for each year or each catchment is not explicitly supported, so users should instead mask their data prior to calling get_yearly_op. Note that when doing this, missing should be adjusted accordingly.

[7]:
# Create a mask beforehand
import random

nyears = np.unique(ds.time.dt.year).size
dom_start = xr.DataArray(
    np.random.randint(1, 30, size=(nyears,)).astype("str"),
    dims=("year"),
    coords={"year": np.unique(ds.time.dt.year)},
)
dom_end = xr.DataArray(
    np.random.randint(1, 30, size=(nyears,)).astype("str"),
    dims=("year"),
    coords={"year": np.unique(ds.time.dt.year)},
)

mask = xr.zeros_like(ds["streamflow"])
for y in dom_start.year.values:
    # Random mask of dates per year, between April and June.
    mask.loc[
        {
            "time": slice(
                str(y) + "-04-" + str(dom_start.sel(year=y).item()),
                str(y) + "-06-" + str(dom_end.sel(year=y).item()),
            )
        }
    ] = 1
[8]:
mask.hvplot(x="time", grid=True, widget_location="bottom", groupby="id")
[8]:
[9]:
# The name of the indexer will be used to identify the variable created here
timeargs_custom = {"custom": {}}

# We use where() to mask the data that we want to ignore
masked = ds.where(mask == 1)
# Since we masked almost all of the year, our tolerance for missing data should be changed accordingly
missing = "at_least_n"
missing_options = {"n": 45}

# We use xr.merge() to combine the results with the previous dataset.
ds_4fa = xr.merge(
    [
        ds_4fa,
        xh.indicators.get_yearly_op(
            masked,
            op="max",
            timeargs=timeargs_custom,
            missing=missing,
            missing_options=missing_options,
        ),
    ]
)
[10]:
ds_4fa.streamflow_max_custom.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[10]:

4.2.4. d) Computing volumes

The frequency analysis can also be performed on volumes, using a similar workflow. The main difference is that if we’re starting from streamflow, we’ll first need to convert them into volumes using xhydro.indicators.compute_volume. Also, if required, get_yearly_op has an argument interpolate_na that can be used to interpolate missing data prior to the sum.

[11]:
# Get a daily volume from a daily streamflow
ds["volume"] = xh.indicators.compute_volume(ds["streamflow"], out_units="hm3")

# We'll take slightly different indexers
timeargs_vol = {"spring": {"date_bounds": ["04-30", "06-15"]}, "annual": {}}

# The operation that we want here is the sum, not the max.
ds_4fa = xr.merge(
    [
        ds_4fa,
        xh.indicators.get_yearly_op(
            ds,
            op="sum",
            input_var="volume",
            timeargs=timeargs_vol,
            missing="pct",
            missing_options={"tolerance": 0.15},
            interpolate_na=True,
        ),
    ]
)
[12]:
ds_4fa.volume_sum_spring.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[12]:

4.3. Local frequency analysis

Once we have our yearly maximums (or volumes/minimums), the first step in a local frequency analysis is to call xhfa.local.fit to obtain distribution parameters. The options are:

  • distributions: a list of SciPy distributions. Defaults to: ["expon", "gamma", "genextreme", "genpareto", "gumbel_r", "pearson3", "weibull_min"].

  • min_years: the minimum number of years required to fit the data.

  • method: the fitting method. Defaults to the maximum likelihood.

[13]:
# To speed up the Notebook, we'll only perform the analysis on a subset of variables
params = xhfa.local.fit(
    ds_4fa[["streamflow_max_spring", "volume_sum_spring"]], min_years=15
)

params
[13]:
<xarray.Dataset> Size: 4kB
Dimensions:                (id: 5, dparams: 5, scipy_dist: 7)
Coordinates: (12/16)
  * id                     (id) object 40B '020302' '020404' ... '020802'
  * dparams                (dparams) <U5 100B 'a' 'c' 'skew' 'loc' 'scale'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    ...                     ...
    source                 <U102 408B 'Ministère de l’Environnement, de la Lu...
    spatial_agg            <U9 36B 'watershed'
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
Data variables:
    streamflow_max_spring  (scipy_dist, id, dparams) float64 1kB dask.array<chunksize=(1, 5, 5), meta=np.ndarray>
    volume_sum_spring      (scipy_dist, id, dparams) float64 1kB dask.array<chunksize=(1, 5, 5), meta=np.ndarray>
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

Information Criteria such as the AIC, BIC, and AICC are useful to determine which statistical distribution is better suited to a given location. These three criteria can be computed using xhfa.local.criteria.

[14]:
criteria = xhfa.local.criteria(
    ds_4fa[["streamflow_max_spring", "volume_sum_spring"]], params
)

criteria
[14]:
<xarray.Dataset> Size: 3kB
Dimensions:                (id: 5, scipy_dist: 7, criterion: 3)
Coordinates: (12/16)
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
  * id                     (id) object 40B '020302' '020404' ... '020802'
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name                   (id) object 40B 'Saint' 'York' ... 'Madeleine'
    ...                     ...
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
  * criterion              (criterion) <U4 48B 'aic' 'bic' 'aicc'
Data variables:
    streamflow_max_spring  (scipy_dist, id, criterion) float64 840B dask.array<chunksize=(1, 5, 3), meta=np.ndarray>
    volume_sum_spring      (scipy_dist, id, criterion) float64 840B dask.array<chunksize=(1, 5, 3), meta=np.ndarray>
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

Finally, return periods can be obtained using xhfa.local.parametric_quantiles. The options are:

  • t: the return period(s) in years.

  • mode: whether the return period is the probability of exceedance ("max") or non-exceedance ("min"). Defaults to "max".

[15]:
rp = xhfa.local.parametric_quantiles(params, t=[20, 100])

rp.load()
[15]:
<xarray.Dataset> Size: 2kB
Dimensions:                (id: 5, scipy_dist: 7, return_period: 2)
Coordinates: (12/17)
  * id                     (id) object 40B '020302' '020404' ... '020802'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    ...                     ...
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * return_period          (return_period) int64 16B 20 100
    p_quantile             (return_period) float64 16B 0.95 0.99
Data variables:
    streamflow_max_spring  (scipy_dist, return_period, id) float64 560B nan ....
    volume_sum_spring      (scipy_dist, return_period, id) float64 560B 751.2...
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

In a future release, plotting will be handled by a proper function. For now, we’ll show an example in this Notebook using preliminary utilities.

xhfa.local._prepare_plots generates datapoints required to plot the results of the frequency analysis. If log=True, it will return log-spaced x values between xmin and xmax.

[16]:
data = xhfa.local._prepare_plots(params, xmin=1, xmax=1000, npoints=50, log=True)
data.load()
[16]:
<xarray.Dataset> Size: 30kB
Dimensions:                (id: 5, scipy_dist: 7, return_period: 50)
Coordinates: (12/17)
  * id                     (id) object 40B '020302' '020404' ... '020802'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    ...                     ...
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * return_period          (return_period) float64 400B 1.0 1.151 ... 1e+03
    p_quantile             (return_period) float64 400B 0.0 0.1315 ... 0.999
Data variables:
    streamflow_max_spring  (scipy_dist, return_period, id) float64 14kB nan ....
    volume_sum_spring      (scipy_dist, return_period, id) float64 14kB 130.9...
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

xhfa.local._get_plotting_positions allows you to get plotting positions for all variables in the dataset. It accepts alpha beta arguments. See the SciPy documentation for typical values. By default, (0.4, 0.4) will be used, which corresponds to approximately quantile unbiased (Cunnane).

[17]:
pp = xhfa.local._get_plotting_positions(ds_4fa[["streamflow_max_spring"]])
pp
[17]:
<xarray.Dataset> Size: 5kB
Dimensions:                   (id: 5, time: 56)
Coordinates: (12/15)
    drainage_area             (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date                  (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
  * id                        (id) object 40B '020302' '020404' ... '020802'
    latitude                  (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude                 (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name                      (id) object 40B 'Saint' 'York' ... 'Madeleine'
    ...                        ...
    spatial_agg               <U9 36B 'watershed'
    start_date                (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg                  <U4 16B 'mean'
    timestep                  <U1 4B 'D'
    variable                  <U10 40B 'streamflow'
  * time                      (time) datetime64[ns] 448B 1970-01-01 ... 2025-...
Data variables:
    streamflow_max_spring_pp  (id, time) float64 2kB nan nan nan ... nan nan nan
    streamflow_max_spring     (id, time) float32 1kB nan nan nan ... nan nan nan
[18]:
# Lets plot the observations
p1 = data.streamflow_max_spring.hvplot(
    x="return_period", by="scipy_dist", grid=True, groupby=["id"], logx=True
)
[19]:
# Lets now plot the distributions
p2 = pp.hvplot.scatter(
    x="streamflow_max_spring_pp",
    y="streamflow_max_spring",
    grid=True,
    groupby=["id"],
    logx=True,
)
[20]:
# And now combining the plots
p1 * p2
[20]: