4. Module d’analyse fréquentielle¶
[1]:
# Basic imports
import hvplot.xarray
import numpy as np
import xarray as xr
import xdatasets as xd
import xhydro as xh
import xhydro.frequency_analysis as xhfa
Redefining 'percent' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '%' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'year' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'yr' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'C' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'd' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'h' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_north' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees_east' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining 'degrees' (<class 'pint.delegates.txt_defparser.plain.UnitDefinition'>)
Redefining '[speed]' (<class 'pint.delegates.txt_defparser.plain.DerivedDimensionDefinition'>)
4.1. Extraction et préparation des données¶
For this example, we’ll conduct a frequency analysis using historical time series from various sites. We begin by obtaining a dataset comprising hydrological information. Here, we use the xdataset library to acquire hydrological data from the Ministère de l’Environnement, de la Lutte contre les changements climatiques, de la Faune et des
Parcs in Québec, Canada. Specifically, our query focuses on stations with IDs beginning with 020
, possessing a natural flow pattern and limited to streamflow data.
Les utilisateurs peuvent préférer générer leur propre xarray.DataArray
en utilisant leur propre jeu de données. Au minimum, le xarray.DataArray
utilisé pour l’analyse fréquentielle doit suivre les principes suivants :
L’ensemble de données a besoin d’une dimension
time
.S’il y a une dimension spatiale, comme
id
dans l’exemple ci-dessous, il faut un attributcf_role
avectimeseries_id
comme valeur.La variable aura au moins besoin d’un attribut
units
, bien que d’autres attributs tels quelong_name
etcell_methods
soient également attendus parxclim
(qui est appelé à différents moments de l’analyse fréquentielle) et des avertissements seront générés s’ils sont manquants.
[2]:
ds = (
xd.Query(
**{
"datasets": {
"deh": {
"id": ["020*"],
"regulated": ["Natural"],
"variables": ["streamflow"],
}
},
"time": {"start": "1970-01-01", "minimum_duration": (15 * 365, "d")},
}
)
.data.squeeze()
.load()
)
# This dataset lacks some of the aforementioned attributes, so we need to add them.
ds["id"].attrs["cf_role"] = "timeseries_id"
ds["streamflow"].attrs = {
"long_name": "Streamflow",
"units": "m3 s-1",
"standard_name": "water_volume_transport_in_river_channel",
"cell_methods": "time: mean",
}
ds
[2]:
<xarray.Dataset> Size: 574kB Dimensions: (id: 5, time: 20454) Coordinates: (12/15) drainage_area (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03 end_date (id) datetime64[ns] 40B 2006-10-13 2024-02-27 ... 1996-08-13 * id (id) object 40B '020302' '020404' '020502' '020602' '020802' latitude (id) float32 20B 48.77 48.81 48.98 48.98 49.2 longitude (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29 name (id) object 40B 'Saint' 'York' ... 'Dartmouth' 'Madeleine' ... ... spatial_agg <U9 36B 'watershed' start_date (id) datetime64[ns] 40B 1989-08-12 1980-10-01 ... 1970-01-01 * time (time) datetime64[ns] 164kB 1970-01-01 ... 2025-12-31 time_agg <U4 16B 'mean' timestep <U1 4B 'D' variable <U10 40B 'streamflow' Data variables: streamflow (id, time) float32 409kB nan nan nan nan ... nan nan nan nan
[3]:
ds.streamflow.dropna("time", how="all").hvplot(
x="time", grid=True, widget_location="bottom", groupby="id"
)
[3]:
4.2. Personnalisation des paramètres d’analyse¶
4.2.1. a) Définition des saisons¶
Nous pouvons définir des saisons en utilisant des indexeurs compatibles avec xclim.core.calendar.select_time
. Il y a actuellement quatre types d’indexeurs acceptés :
month
, suivi d’une séquence de numéros de mois.season
, followed by one or more of'DJF'
,'MAM'
,'JJA'
, and'SON'
.doy_bounds
, followed by a sequence representing the inclusive bounds of the period to be considered ("start"
,"end"
).date_bounds
, which is the same as above, but using a month-day ('%m-%d'
) format.
Pour obtenir les maxima de blocs par xhydro.indicators.get_yearly_op
, les indexeurs doivent être regroupés dans un dictionnaire, la clé étant l’étiquette à donner à la période de l’année demandée. Une deuxième clé peut être utilisée pour donner des instructions sur la fréquence de rééchantillonnage, par exemple pour envelopper l’année pour l’hiver.
[4]:
# Some examples
timeargs = {
"spring": {"date_bounds": ["02-11", "06-19"]},
"summer": {"doy_bounds": [152, 243]},
"fall": {"month": [9, 10, 11]},
"winter": {
"season": ["DJF"],
"freq": "YS-DEC",
}, # To correctly wrap around the year, we need to specify the resampling frequency.
"august": {"month": [8]},
"annual": {},
}
4.2.2. b) Obtenir des maxima de bloc¶
Après avoir sélectionné chaque saison souhaitée, nous pouvons extraire les séries temporelles de maxima de blocs de chaque station en utilisant xhydro.indicators.get_yearly_op
. Les principaux arguments sont :
op
: the operation to compute. One of"max"
,"min"
,"mean"
, or"sum"
.input_var
: the name of the variable. Defaults to"streamflow"
.window
: the size of the rolling window. A"mean"
is performed on the rolling window prior to theop
operation.timeargs
: comme défini précédemment. Laisser àNone
pour obtenir les maxima annuels.missing
etmissing_options
: pour définir les tolérances pour les données manquantes. Voir cette page pour plus d’informations.interpolate_na
: s’il faut interpoler les données manquantes avant l’opérationop
. Uniquement utilisé poursum
.
La fonction renvoie un xarray.Dataset
avec 1 variable par indexeur.
[5]:
# Here, we hide years with more than 15% of missing data.
ds_4fa = xh.indicators.get_yearly_op(
ds, op="max", timeargs=timeargs, missing="pct", missing_options={"tolerance": 0.15}
)
ds_4fa
[5]:
<xarray.Dataset> Size: 8kB Dimensions: (id: 5, time: 56) Coordinates: (12/15) drainage_area (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03 end_date (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13 * id (id) object 40B '020302' '020404' ... '020802' latitude (id) float32 20B 48.77 48.81 48.98 48.98 49.2 longitude (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29 name (id) object 40B 'Saint' 'York' ... 'Madeleine' ... ... spatial_agg <U9 36B 'watershed' start_date (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01 time_agg <U4 16B 'mean' timestep <U1 4B 'D' variable <U10 40B 'streamflow' * time (time) datetime64[ns] 448B 1970-01-01 ... 2025-01-01 Data variables: streamflow_max_spring (id, time) float32 1kB nan nan nan ... nan nan nan streamflow_max_summer (id, time) float32 1kB nan nan nan ... nan nan nan streamflow_max_fall (id, time) float32 1kB nan nan nan ... nan nan nan streamflow_max_august (id, time) float32 1kB nan nan nan ... nan nan nan streamflow_max_annual (id, time) float32 1kB nan nan nan ... nan nan nan streamflow_max_winter (id, time) float32 1kB nan nan nan ... nan nan nan Attributes: cat:frequency: yr cat:processing_level: indicators cat:id:
[6]:
ds_4fa.streamflow_max_summer.dropna("time", how="all").hvplot(
x="time", grid=True, widget_location="bottom", groupby="id"
)
[6]:
4.2.3. c) Utilisation de saisons personnalisées par année ou par station¶
Using individualized date ranges for each year or each catchment is not explicitly supported, so users should instead mask their data prior to calling get_yearly_op
. Note that when doing this, missing
should be adjusted accordingly.
[7]:
# Create a mask beforehand
import random
nyears = np.unique(ds.time.dt.year).size
dom_start = xr.DataArray(
np.random.randint(1, 30, size=(nyears,)).astype("str"),
dims=("year"),
coords={"year": np.unique(ds.time.dt.year)},
)
dom_end = xr.DataArray(
np.random.randint(1, 30, size=(nyears,)).astype("str"),
dims=("year"),
coords={"year": np.unique(ds.time.dt.year)},
)
mask = xr.zeros_like(ds["streamflow"])
for y in dom_start.year.values:
# Random mask of dates per year, between April and June.
mask.loc[
{
"time": slice(
str(y) + "-04-" + str(dom_start.sel(year=y).item()),
str(y) + "-06-" + str(dom_end.sel(year=y).item()),
)
}
] = 1
[8]:
mask.hvplot(x="time", grid=True, widget_location="bottom", groupby="id")
[8]:
[9]:
# The name of the indexer will be used to identify the variable created here
timeargs_custom = {"custom": {}}
# We use where() to mask the data that we want to ignore
masked = ds.where(mask == 1)
# Since we masked almost all of the year, our tolerance for missing data should be changed accordingly
missing = "at_least_n"
missing_options = {"n": 45}
# We use xr.merge() to combine the results with the previous dataset.
ds_4fa = xr.merge(
[
ds_4fa,
xh.indicators.get_yearly_op(
masked,
op="max",
timeargs=timeargs_custom,
missing=missing,
missing_options=missing_options,
),
]
)
[10]:
ds_4fa.streamflow_max_custom.dropna("time", how="all").hvplot(
x="time", grid=True, widget_location="bottom", groupby="id"
)
[10]:
4.2.4. d) Calcul des volumes¶
L’analyse de fréquence peut également être effectuée sur des volumes, en utilisant un flux de travail similaire. La principale différence est que si nous partons d’un débit, nous devrons d’abord le convertir en volume en utilisant xhydro.indicators.compute_volume
. De plus, si nécessaire, get_yearly_op
a un argument interpolate_na
qui peut être utilisé pour interpoler les données manquantes avant la somme.
[11]:
# Get a daily volume from a daily streamflow
ds["volume"] = xh.indicators.compute_volume(ds["streamflow"], out_units="hm3")
# We'll take slightly different indexers
timeargs_vol = {"spring": {"date_bounds": ["04-30", "06-15"]}, "annual": {}}
# The operation that we want here is the sum, not the max.
ds_4fa = xr.merge(
[
ds_4fa,
xh.indicators.get_yearly_op(
ds,
op="sum",
input_var="volume",
timeargs=timeargs_vol,
missing="pct",
missing_options={"tolerance": 0.15},
interpolate_na=True,
),
]
)
[12]:
ds_4fa.volume_sum_spring.dropna("time", how="all").hvplot(
x="time", grid=True, widget_location="bottom", groupby="id"
)
[12]:
4.3. Analyse fréquentielle locale¶
Une fois que nous avons nos maximums annuels (ou volumes/minimums), la première étape d’une analyse de fréquence locale est d’appeler xhfa.local.fit
pour obtenir les paramètres de distribution. Les options sont les suivantes :
distributions
: a list of SciPy distributions. Defaults to:["expon", "gamma", "genextreme", "genpareto", "gumbel_r", "pearson3", "weibull_min"]
.min_years
: the minimum number of years required to fit the data.method
: the fitting method. Defaults to the maximum likelihood.
[13]:
# To speed up the Notebook, we'll only perform the analysis on a subset of variables
params = xhfa.local.fit(
ds_4fa[["streamflow_max_spring", "volume_sum_spring"]], min_years=15
)
params
[13]:
<xarray.Dataset> Size: 4kB Dimensions: (id: 5, dparams: 5, scipy_dist: 7) Coordinates: (12/16) * id (id) object 40B '020302' '020404' ... '020802' * dparams (dparams) <U5 100B 'a' 'c' 'skew' 'loc' 'scale' * scipy_dist (scipy_dist) <U11 308B 'expon' ... 'weibull_min' drainage_area (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03 end_date (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13 latitude (id) float32 20B 48.77 48.81 48.98 48.98 49.2 ... ... source <U102 408B 'Ministère de l’Environnement, de la Lu... spatial_agg <U9 36B 'watershed' start_date (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01 time_agg <U4 16B 'mean' timestep <U1 4B 'D' variable <U10 40B 'streamflow' Data variables: streamflow_max_spring (scipy_dist, id, dparams) float64 1kB dask.array<chunksize=(1, 5, 5), meta=np.ndarray> volume_sum_spring (scipy_dist, id, dparams) float64 1kB dask.array<chunksize=(1, 5, 5), meta=np.ndarray> Attributes: cat:frequency: yr cat:processing_level: indicators cat:id:
Information Criteria such as the AIC
, BIC
, and AICC
are useful to determine which statistical distribution is better suited to a given location. These three criteria can be computed using xhfa.local.criteria
.
[14]:
criteria = xhfa.local.criteria(
ds_4fa[["streamflow_max_spring", "volume_sum_spring"]], params
)
criteria
[14]:
<xarray.Dataset> Size: 3kB Dimensions: (id: 5, scipy_dist: 7, criterion: 3) Coordinates: (12/16) drainage_area (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03 end_date (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13 * id (id) object 40B '020302' '020404' ... '020802' latitude (id) float32 20B 48.77 48.81 48.98 48.98 49.2 longitude (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29 name (id) object 40B 'Saint' 'York' ... 'Madeleine' ... ... start_date (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01 time_agg <U4 16B 'mean' timestep <U1 4B 'D' variable <U10 40B 'streamflow' * scipy_dist (scipy_dist) <U11 308B 'expon' ... 'weibull_min' * criterion (criterion) <U4 48B 'aic' 'bic' 'aicc' Data variables: streamflow_max_spring (scipy_dist, id, criterion) float64 840B dask.array<chunksize=(1, 5, 3), meta=np.ndarray> volume_sum_spring (scipy_dist, id, criterion) float64 840B dask.array<chunksize=(1, 5, 3), meta=np.ndarray> Attributes: cat:frequency: yr cat:processing_level: indicators cat:id:
Enfin, les périodes de retour peuvent être obtenues en utilisant xhfa.local.parametric_quantiles
. Les options sont les suivantes :
t
: la (les) période(s) de retour en années.mode
: whether the return period is the probability of exceedance ("max"
) or non-exceedance ("min"
). Defaults to"max"
.
[15]:
rp = xhfa.local.parametric_quantiles(params, t=[20, 100])
rp.load()
[15]:
<xarray.Dataset> Size: 2kB Dimensions: (id: 5, scipy_dist: 7, return_period: 2) Coordinates: (12/17) * id (id) object 40B '020302' '020404' ... '020802' * scipy_dist (scipy_dist) <U11 308B 'expon' ... 'weibull_min' drainage_area (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03 end_date (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13 latitude (id) float32 20B 48.77 48.81 48.98 48.98 49.2 longitude (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29 ... ... start_date (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01 time_agg <U4 16B 'mean' timestep <U1 4B 'D' variable <U10 40B 'streamflow' * return_period (return_period) int64 16B 20 100 p_quantile (return_period) float64 16B 0.95 0.99 Data variables: streamflow_max_spring (scipy_dist, return_period, id) float64 560B nan .... volume_sum_spring (scipy_dist, return_period, id) float64 560B 751.2... Attributes: cat:frequency: yr cat:processing_level: indicators cat:id:
In a future release, plotting will be handled by a proper function. For now, we’ll show an example in this Notebook using preliminary utilities.
xhfa.local._prepare_plots
generates datapoints required to plot the results of the frequency analysis. If log=True
, it will return log-spaced x values between xmin
and xmax
.
[16]:
data = xhfa.local._prepare_plots(params, xmin=1, xmax=1000, npoints=50, log=True)
data.load()
[16]:
<xarray.Dataset> Size: 30kB Dimensions: (id: 5, scipy_dist: 7, return_period: 50) Coordinates: (12/17) * id (id) object 40B '020302' '020404' ... '020802' * scipy_dist (scipy_dist) <U11 308B 'expon' ... 'weibull_min' drainage_area (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03 end_date (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13 latitude (id) float32 20B 48.77 48.81 48.98 48.98 49.2 longitude (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29 ... ... start_date (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01 time_agg <U4 16B 'mean' timestep <U1 4B 'D' variable <U10 40B 'streamflow' * return_period (return_period) float64 400B 1.0 1.151 ... 1e+03 p_quantile (return_period) float64 400B 0.0 0.1315 ... 0.999 Data variables: streamflow_max_spring (scipy_dist, return_period, id) float64 14kB nan .... volume_sum_spring (scipy_dist, return_period, id) float64 14kB 130.9... Attributes: cat:frequency: yr cat:processing_level: indicators cat:id:
xhfa.local._get_plotting_positions
allows you to get plotting positions for all variables in the dataset. It accepts alpha
beta
arguments. See the SciPy documentation for typical values. By default, (0.4, 0.4) will be used, which corresponds to approximately quantile unbiased (Cunnane).
[17]:
pp = xhfa.local._get_plotting_positions(ds_4fa[["streamflow_max_spring"]])
pp
[17]:
<xarray.Dataset> Size: 5kB Dimensions: (id: 5, time: 56) Coordinates: (12/15) drainage_area (id) float32 20B 1.09e+03 647.0 59.8 626.0 1.2e+03 end_date (id) datetime64[ns] 40B 2006-10-13 ... 1996-08-13 * id (id) object 40B '020302' '020404' ... '020802' latitude (id) float32 20B 48.77 48.81 48.98 48.98 49.2 longitude (id) float32 20B -64.52 -64.92 -64.43 -64.7 -65.29 name (id) object 40B 'Saint' 'York' ... 'Madeleine' ... ... spatial_agg <U9 36B 'watershed' start_date (id) datetime64[ns] 40B 1989-08-12 ... 1970-01-01 time_agg <U4 16B 'mean' timestep <U1 4B 'D' variable <U10 40B 'streamflow' * time (time) datetime64[ns] 448B 1970-01-01 ... 2025-... Data variables: streamflow_max_spring_pp (id, time) float64 2kB nan nan nan ... nan nan nan streamflow_max_spring (id, time) float32 1kB nan nan nan ... nan nan nan
[18]:
# Lets plot the observations
p1 = data.streamflow_max_spring.hvplot(
x="return_period", by="scipy_dist", grid=True, groupby=["id"], logx=True
)
[19]:
# Lets now plot the distributions
p2 = pp.hvplot.scatter(
x="streamflow_max_spring_pp",
y="streamflow_max_spring",
grid=True,
groupby=["id"],
logx=True,
)
[20]:
# And now combining the plots
p1 * p2
[20]: