4. Module d’analyse fréquentielle

[1]:
# Basic imports
import hvplot.xarray
import numpy as np
import xarray as xr
import xdatasets as xd

import xhydro as xh
import xhydro.frequency_analysis as xhfa
Redefining 'percent' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '%' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'year' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'yr' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'C' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'd' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'h' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_north' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_east' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '[speed]' (<class 'pint.delegates.txt_defparser.plain.DerivedDimensionDefinition'>)

4.1. Extraction et préparation des données

For this example, we’ll conduct a frequency analysis using historical time series from various sites. We begin by obtaining a dataset comprising hydrological information. Here, we use the xdataset library to acquire hydrological data from the Ministère de l’Environnement, de la Lutte contre les changements climatiques, de la Faune et des Parcs in Québec, Canada. Specifically, our query focuses on stations with IDs beginning with 020, possessing a natural flow pattern and limited to streamflow data.

Les utilisateurs peuvent préférer générer leur propre xarray.DataArray en utilisant leur propre jeu de données. Au minimum, le xarray.DataArray utilisé pour l’analyse fréquentielle doit suivre les principes suivants :

  • L’ensemble de données a besoin d’une dimension time.

  • S’il y a une dimension spatiale, comme id dans l’exemple ci-dessous, il faut un attribut cf_role avec timeseries_id comme valeur.

  • La variable aura au moins besoin d’un attribut units, bien que d’autres attributs tels que long_name et cell_methods soient également attendus par xclim (qui est appelé à différents moments de l’analyse fréquentielle) et des avertissements seront générés s’ils sont manquants.

[2]:
ds = (
    xd.Query(
        **{
            "datasets": {
                "deh": {
                    "id": ["020*"],
                    "regulated": ["Natural"],
                    "variables": ["streamflow"],
                }
            },
            "time": {"start": "1970-01-01", "minimum_duration": (15 * 365, "d")},
        }
    )
    .data.squeeze()
    .load()
)

# This dataset lacks some of the aforementioned attributes, so we need to add them.
ds["id"].attrs["cf_role"] = "timeseries_id"
ds["streamflow"].attrs = {
    "long_name": "Streamflow",
    "units": "m3 s-1",
    "standard_name": "water_volume_transport_in_river_channel",
    "cell_methods": "time: mean",
}

ds
[2]:
<xarray.Dataset> Size: 574kB
Dimensions:        (id: 5, time: 20454)
Coordinates: (12/15)
    drainage_area  (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date       (id) datetime64[ns] 40B 2006-10-13 2024-02-27 ... 1996-08-13
  * id             (id) object 40B '020302' '020404' '020502' '020602' '020802'
    latitude       (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude      (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name           (id) object 40B 'Saint' 'York' ... 'Dartmouth' 'Madeleine'
    ...             ...
    spatial_agg    <U9 36B 'watershed'
    start_date     (id) datetime64[ns] 40B 1989-08-12 1980-10-01 ... 1970-01-01
  * time           (time) datetime64[ns] 164kB 1970-01-01 ... 2025-12-31
    time_agg       <U4 16B 'mean'
    timestep       <U1 4B 'D'
    variable       <U10 40B 'streamflow'
Data variables:
    streamflow     (id, time) float32 409kB nan nan nan nan ... nan nan nan nan
[3]:
ds.streamflow.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[3]:

4.2. Personnalisation des paramètres d’analyse

4.2.1. a) Définition des saisons

Nous pouvons définir des saisons en utilisant des indexeurs compatibles avec xclim.core.calendar.select_time. Il y a actuellement quatre types d’indexeurs acceptés :

  • month, suivi d’une séquence de numéros de mois.

  • season, followed by one or more of 'DJF', 'MAM', 'JJA', and 'SON'.

  • doy_bounds, followed by a sequence representing the inclusive bounds of the period to be considered ("start", "end").

  • date_bounds, which is the same as above, but using a month-day ('%m-%d') format.

Pour obtenir les maxima de blocs par xhydro.indicators.get_yearly_op, les indexeurs doivent être regroupés dans un dictionnaire, la clé étant l’étiquette à donner à la période de l’année demandée. Une deuxième clé peut être utilisée pour donner des instructions sur la fréquence de rééchantillonnage, par exemple pour envelopper l’année pour l’hiver.

[4]:
# Some examples
timeargs = {
    "spring": {"date_bounds": ["02-11", "06-19"]},
    "summer": {"doy_bounds": [152, 243]},
    "fall": {"month": [9, 10, 11]},
    "winter": {
        "season": ["DJF"],
        "freq": "YS-DEC",
    },  # To correctly wrap around the year, we need to specify the resampling frequency.
    "august": {"month": [8]},
    "annual": {},
}

4.2.2. b) Obtenir des maxima de bloc

Après avoir sélectionné chaque saison souhaitée, nous pouvons extraire les séries temporelles de maxima de blocs de chaque station en utilisant xhydro.indicators.get_yearly_op. Les principaux arguments sont :

  • op: the operation to compute. One of "max", "min", "mean", or "sum".

  • input_var: the name of the variable. Defaults to "streamflow".

  • window: the size of the rolling window. A "mean" is performed on the rolling window prior to the op operation.

  • timeargs : comme défini précédemment. Laisser à None pour obtenir les maxima annuels.

  • missing et missing_options : pour définir les tolérances pour les données manquantes. Voir cette page pour plus d’informations.

  • interpolate_na : s’il faut interpoler les données manquantes avant l’opération op. Uniquement utilisé pour sum.

La fonction renvoie un xarray.Dataset avec 1 variable par indexeur.

[5]:
# Here, we hide years with more than 15% of missing data.
ds_4fa = xh.indicators.get_yearly_op(
    ds, op="max", timeargs=timeargs, missing="pct", missing_options={"tolerance": 0.15}
)

ds_4fa
[5]:
<xarray.Dataset> Size: 8kB
Dimensions:                (id: 5, time: 56)
Coordinates: (12/15)
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
  * id                     (id) object 40B '020302' '020404' ... '020802'
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name                   (id) object 40B 'Saint' 'York' ... 'Madeleine'
    ...                     ...
    spatial_agg            <U9 36B 'watershed'
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * time                   (time) datetime64[ns] 448B 1970-01-01 ... 2025-01-01
Data variables:
    streamflow_max_spring  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_summer  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_fall    (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_august  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_annual  (id, time) float32 1kB nan nan nan ... nan nan nan
    streamflow_max_winter  (id, time) float32 1kB nan nan nan ... nan nan nan
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                
[6]:
ds_4fa.streamflow_max_summer.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[6]:

4.2.3. c) Utilisation de saisons personnalisées par année ou par station

Using individualized date ranges for each year or each catchment is not explicitly supported, so users should instead mask their data prior to calling get_yearly_op. Note that when doing this, missing should be adjusted accordingly.

[7]:
# Create a mask beforehand
import random

nyears = np.unique(ds.time.dt.year).size
dom_start = xr.DataArray(
    np.random.randint(1, 30, size=(nyears,)).astype("str"),
    dims=("year"),
    coords={"year": np.unique(ds.time.dt.year)},
)
dom_end = xr.DataArray(
    np.random.randint(1, 30, size=(nyears,)).astype("str"),
    dims=("year"),
    coords={"year": np.unique(ds.time.dt.year)},
)

mask = xr.zeros_like(ds["streamflow"])
for y in dom_start.year.values:
    # Random mask of dates per year, between April and June.
    mask.loc[
        {
            "time": slice(
                str(y) + "-04-" + str(dom_start.sel(year=y).item()),
                str(y) + "-06-" + str(dom_end.sel(year=y).item()),
            )
        }
    ] = 1
[8]:
mask.hvplot(x="time", grid=True, widget_location="bottom", groupby="id")
[8]:
[9]:
# The name of the indexer will be used to identify the variable created here
timeargs_custom = {"custom": {}}

# We use where() to mask the data that we want to ignore
masked = ds.where(mask == 1)
# Since we masked almost all of the year, our tolerance for missing data should be changed accordingly
missing = "at_least_n"
missing_options = {"n": 45}

# We use xr.merge() to combine the results with the previous dataset.
ds_4fa = xr.merge(
    [
        ds_4fa,
        xh.indicators.get_yearly_op(
            masked,
            op="max",
            timeargs=timeargs_custom,
            missing=missing,
            missing_options=missing_options,
        ),
    ]
)
[10]:
ds_4fa.streamflow_max_custom.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[10]:

4.2.4. d) Calcul des volumes

L’analyse de fréquence peut également être effectuée sur des volumes, en utilisant un flux de travail similaire. La principale différence est que si nous partons d’un débit, nous devrons d’abord le convertir en volume en utilisant xhydro.indicators.compute_volume. De plus, si nécessaire, get_yearly_op a un argument interpolate_na qui peut être utilisé pour interpoler les données manquantes avant la somme.

[11]:
# Get a daily volume from a daily streamflow
ds["volume"] = xh.indicators.compute_volume(ds["streamflow"], out_units="hm3")

# We'll take slightly different indexers
timeargs_vol = {"spring": {"date_bounds": ["04-30", "06-15"]}, "annual": {}}

# The operation that we want here is the sum, not the max.
ds_4fa = xr.merge(
    [
        ds_4fa,
        xh.indicators.get_yearly_op(
            ds,
            op="sum",
            input_var="volume",
            timeargs=timeargs_vol,
            missing="pct",
            missing_options={"tolerance": 0.15},
            interpolate_na=True,
        ),
    ]
)
[12]:
ds_4fa.volume_sum_spring.dropna("time", how="all").hvplot(
    x="time", grid=True, widget_location="bottom", groupby="id"
)
[12]:

4.3. Analyse fréquentielle locale

Une fois que nous avons nos maximums annuels (ou volumes/minimums), la première étape d’une analyse de fréquence locale est d’appeler xhfa.local.fit pour obtenir les paramètres de distribution. Les options sont les suivantes :

  • distributions: a list of SciPy distributions. Defaults to: ["expon", "gamma", "genextreme", "genpareto", "gumbel_r", "pearson3", "weibull_min"].

  • min_years: the minimum number of years required to fit the data.

  • method: the fitting method. Defaults to the maximum likelihood.

[13]:
# To speed up the Notebook, we'll only perform the analysis on a subset of variables
params = xhfa.local.fit(
    ds_4fa[["streamflow_max_spring", "volume_sum_spring"]], min_years=15
)

params
[13]:
<xarray.Dataset> Size: 4kB
Dimensions:                (id: 5, dparams: 5, scipy_dist: 7)
Coordinates: (12/16)
  * id                     (id) object 40B '020302' '020404' ... '020802'
  * dparams                (dparams) <U5 100B 'a' 'c' 'skew' 'loc' 'scale'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    ...                     ...
    source                 <U102 408B 'Ministère de l’Environnement, de la Lu...
    spatial_agg            <U9 36B 'watershed'
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
Data variables:
    streamflow_max_spring  (scipy_dist, id, dparams) float64 1kB dask.array<chunksize=(1, 5, 5), meta=np.ndarray>
    volume_sum_spring      (scipy_dist, id, dparams) float64 1kB dask.array<chunksize=(1, 5, 5), meta=np.ndarray>
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

Information Criteria such as the AIC, BIC, and AICC are useful to determine which statistical distribution is better suited to a given location. These three criteria can be computed using xhfa.local.criteria.

[14]:
criteria = xhfa.local.criteria(
    ds_4fa[["streamflow_max_spring", "volume_sum_spring"]], params
)

criteria
[14]:
<xarray.Dataset> Size: 3kB
Dimensions:                (id: 5, scipy_dist: 7, criterion: 3)
Coordinates: (12/16)
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
  * id                     (id) object 40B '020302' '020404' ... '020802'
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name                   (id) object 40B 'Saint' 'York' ... 'Madeleine'
    ...                     ...
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
  * criterion              (criterion) <U4 48B 'aic' 'bic' 'aicc'
Data variables:
    streamflow_max_spring  (scipy_dist, id, criterion) float64 840B dask.array<chunksize=(1, 5, 3), meta=np.ndarray>
    volume_sum_spring      (scipy_dist, id, criterion) float64 840B dask.array<chunksize=(1, 5, 3), meta=np.ndarray>
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

Enfin, les périodes de retour peuvent être obtenues en utilisant xhfa.local.parametric_quantiles. Les options sont les suivantes :

  • t : la (les) période(s) de retour en années.

  • mode: whether the return period is the probability of exceedance ("max") or non-exceedance ("min"). Defaults to "max".

[15]:
rp = xhfa.local.parametric_quantiles(params, t=[20, 100])

rp.load()
[15]:
<xarray.Dataset> Size: 2kB
Dimensions:                (id: 5, scipy_dist: 7, return_period: 2)
Coordinates: (12/17)
  * id                     (id) object 40B '020302' '020404' ... '020802'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    ...                     ...
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * return_period          (return_period) int64 16B 20 100
    p_quantile             (return_period) float64 16B 0.95 0.99
Data variables:
    streamflow_max_spring  (scipy_dist, return_period, id) float64 560B nan ....
    volume_sum_spring      (scipy_dist, return_period, id) float64 560B 751.2...
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

In a future release, plotting will be handled by a proper function. For now, we’ll show an example in this Notebook using preliminary utilities.

xhfa.local._prepare_plots generates datapoints required to plot the results of the frequency analysis. If log=True, it will return log-spaced x values between xmin and xmax.

[16]:
data = xhfa.local._prepare_plots(params, xmin=1, xmax=1000, npoints=50, log=True)
data.load()
[16]:
<xarray.Dataset> Size: 30kB
Dimensions:                (id: 5, scipy_dist: 7, return_period: 50)
Coordinates: (12/17)
  * id                     (id) object 40B '020302' '020404' ... '020802'
  * scipy_dist             (scipy_dist) <U11 308B 'expon' ... 'weibull_min'
    drainage_area          (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date               (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
    latitude               (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude              (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    ...                     ...
    start_date             (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg               <U4 16B 'mean'
    timestep               <U1 4B 'D'
    variable               <U10 40B 'streamflow'
  * return_period          (return_period) float64 400B 1.0 1.151 ... 1e+03
    p_quantile             (return_period) float64 400B 0.0 0.1315 ... 0.999
Data variables:
    streamflow_max_spring  (scipy_dist, return_period, id) float64 14kB nan ....
    volume_sum_spring      (scipy_dist, return_period, id) float64 14kB 130.9...
Attributes:
    cat:frequency:         yr
    cat:processing_level:  indicators
    cat:id:                

xhfa.local._get_plotting_positions allows you to get plotting positions for all variables in the dataset. It accepts alpha beta arguments. See the SciPy documentation for typical values. By default, (0.4, 0.4) will be used, which corresponds to approximately quantile unbiased (Cunnane).

[17]:
pp = xhfa.local._get_plotting_positions(ds_4fa[["streamflow_max_spring"]])
pp
[17]:
<xarray.Dataset> Size: 5kB
Dimensions:                   (id: 5, time: 56)
Coordinates: (12/15)
    drainage_area             (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03
    end_date                  (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13
  * id                        (id) object 40B '020302' '020404' ... '020802'
    latitude                  (id) float32 20B 48.77 48.81 48.98 48.98 49.2
    longitude                 (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29
    name                      (id) object 40B 'Saint' 'York' ... 'Madeleine'
    ...                        ...
    spatial_agg               <U9 36B 'watershed'
    start_date                (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01
    time_agg                  <U4 16B 'mean'
    timestep                  <U1 4B 'D'
    variable                  <U10 40B 'streamflow'
  * time                      (time) datetime64[ns] 448B 1970-01-01 ... 2025-...
Data variables:
    streamflow_max_spring_pp  (id, time) float64 2kB nan nan nan ... nan nan nan
    streamflow_max_spring     (id, time) float32 1kB nan nan nan ... nan nan nan
[18]:
# Lets plot the observations
p1 = data.streamflow_max_spring.hvplot(
    x="return_period", by="scipy_dist", grid=True, groupby=["id"], logx=True
)
[19]:
# Lets now plot the distributions
p2 = pp.hvplot.scatter(
    x="streamflow_max_spring_pp",
    y="streamflow_max_spring",
    grid=True,
    groupby=["id"],
    logx=True,
)
[20]:
# And now combining the plots
p1 * p2
[20]: