{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Frequency analysis module - Local analysis" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Basic imports\n", "import hvplot.xarray # noqa\n", "import numpy as np\n", "import xarray as xr\n", "import xdatasets as xd\n", "\n", "import xhydro as xh\n", "import xhydro.frequency_analysis as xhfa" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Extracting and preparing the data\n", "For this example, we'll conduct a frequency analysis using historical time series from various sites. We begin by obtaining a dataset comprising hydrological information. Here, we use the [xdataset](https://hydrologie.github.io/xdatasets/notebooks/getting_started.html) library to acquire hydrological data from the [Ministère de l'Environnement, de la Lutte contre les changements climatiques, de la Faune et des Parcs](https://www.cehq.gouv.qc.ca/atlas-hydroclimatique/stations-hydrometriques/index.htm) in Québec, Canada. Specifically, our query focuses on stations with IDs beginning with `020`, possessing a natural flow pattern and limited to streamflow data.\n", "\n", "Users may prefer to generate their own `xarray.DataArray` using their individual dataset. At a minimum, the `xarray.DataArray` used for frequency analysis has to follow these principles:\n", "\n", "- The dataset needs a `time` dimension.\n", "- If there is a spatial dimension, such as `id` in the example below, it needs an attribute `cf_role` with `timeseries_id` as its value.\n", "- The variable will at the very least need a `units` attribute, although other attributes such as `long_name` and `cell_methods` are also expected by `xclim` (which is called at various points during the frequency analysis) and warnings will be generated if they are missing." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds = (\n", " xd.Query(\n", " **{\n", " \"datasets\": {\n", " \"deh\": {\n", " \"id\": [\"020*\"],\n", " \"regulated\": [\"Natural\"],\n", " \"variables\": [\"streamflow\"],\n", " }\n", " },\n", " \"time\": {\"start\": \"1970-01-01\", \"minimum_duration\": (15 * 365, \"d\")},\n", " }\n", " )\n", " .data.squeeze()\n", " .load()\n", ")\n", "\n", "# This dataset lacks some of the aforementioned attributes, so we need to add them.\n", "ds[\"id\"].attrs[\"cf_role\"] = \"timeseries_id\"\n", "ds[\"streamflow\"].attrs = {\n", " \"long_name\": \"Streamflow\",\n", " \"units\": \"m3 s-1\",\n", " \"standard_name\": \"water_volume_transport_in_river_channel\",\n", " \"cell_methods\": \"time: mean\",\n", "}\n", "\n", "# For the purpose of this example, we keep only 2 stations\n", "ds = ds.isel(id=slice(3, 5))\n", "ds" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds.streamflow.dropna(\"time\", how=\"all\").hvplot(\n", " x=\"time\", grid=True, widget_location=\"bottom\", groupby=\"id\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Customizing the analysis settings\n", "\n", "### a) Defining seasons\n", "We can define seasons using indexers that are compatible with `xclim.core.calendar.select_time`. There are currently four accepted types of indexers:\n", "\n", "- `month`, followed by a sequence of month numbers.\n", "- `season`, followed by one or more of `'DJF'`, `'MAM'`, `'JJA'`, and `'SON'`.\n", "- `doy_bounds`, followed by a sequence representing the inclusive bounds of the period to be considered (`\"start\"`, `\"end\"`).\n", "- `date_bounds`, which is the same as above, but using a month-day (`'%m-%d'`) format.\n", "\n", "For the purpose of getting block maxima through `xhydro.indicators.get_yearly_op`, the indexers need to be grouped within a dictionary, with the key being the label to be given to the requested period of the year. A second key can be used to instruct on the resampling frequency, for example to wrap around the year for winter." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Some examples\n", "timeargs = {\n", " \"spring\": {\"date_bounds\": [\"02-11\", \"06-19\"]},\n", " \"summer\": {\"doy_bounds\": [152, 243]},\n", " \"fall\": {\"month\": [9, 10, 11]},\n", " \"winter\": {\n", " \"season\": [\"DJF\"],\n", " \"freq\": \"YS-DEC\",\n", " }, # To correctly wrap around the year, we need to specify the resampling frequency.\n", " \"august\": {\"month\": [8]},\n", " \"annual\": {},\n", "}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### b) Getting block maxima\n", "Upon selecting each desired season, we can extract block maxima timeseries from every station using `xhydro.indicators.get_yearly_op`. The main arguments are:\n", "\n", "- `op`: the operation to compute. One of `\"max\"`, `\"min\"`, `\"mean\"`, or `\"sum\"`.\n", "- `input_var`: the name of the variable. Defaults to `\"streamflow\"`.\n", "- `window`: the size of the rolling window. A `\"mean\"` is performed on the rolling window prior to the `op` operation.\n", "- `timeargs`: as defined previously. Leave at `None` to get the annual maxima.\n", "- `missing` and `missing_options`: to define tolerances for missing data. See [this page](https://xclim.readthedocs.io/en/stable/checks.html#missing-values-identification) for more information.\n", "- `interpolate_na`: whether to interpolate missing data prior to the `op` operation. Only used for `sum`." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The function returns a `xarray.Dataset` with 1 variable per indexer." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Here, we hide years with more than 15% of missing data.\n", "ds_4fa = xh.indicators.get_yearly_op(\n", " ds, op=\"max\", timeargs=timeargs, missing=\"pct\", missing_options={\"tolerance\": 0.15}\n", ")\n", "\n", "ds_4fa" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_4fa.streamflow_max_summer.dropna(\"time\", how=\"all\").hvplot(\n", " x=\"time\", grid=True, widget_location=\"bottom\", groupby=\"id\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### c) Using custom seasons per year or per station\n", "\n", "Using individualized date ranges for each year or each catchment is not explicitly supported, so users should instead mask their data prior to calling `get_yearly_op`. Note that when doing this, `missing` should be adjusted accordingly." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Create a mask beforehand\n", "\n", "nyears = np.unique(ds.time.dt.year).size\n", "dom_start = xr.DataArray(\n", " np.random.randint(1, 30, size=(nyears,)).astype(\"str\"),\n", " dims=(\"year\"),\n", " coords={\"year\": np.unique(ds.time.dt.year)},\n", ")\n", "dom_end = xr.DataArray(\n", " np.random.randint(1, 30, size=(nyears,)).astype(\"str\"),\n", " dims=(\"year\"),\n", " coords={\"year\": np.unique(ds.time.dt.year)},\n", ")\n", "\n", "mask = xr.zeros_like(ds[\"streamflow\"])\n", "for y in dom_start.year.values:\n", " # Random mask of dates per year, between April and June.\n", " mask.loc[\n", " {\n", " \"time\": slice(\n", " str(y) + \"-04-\" + str(dom_start.sel(year=y).item()),\n", " str(y) + \"-06-\" + str(dom_end.sel(year=y).item()),\n", " )\n", " }\n", " ] = 1" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "mask.hvplot(x=\"time\", grid=True, widget_location=\"bottom\", groupby=\"id\")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# The name of the indexer will be used to identify the variable created here\n", "timeargs_custom = {\"custom\": {}}\n", "\n", "# We use where() to mask the data that we want to ignore\n", "masked = ds.where(mask == 1)\n", "# Since we masked almost all the year, our tolerance for missing data should be changed accordingly\n", "missing = \"at_least_n\"\n", "missing_options = {\"n\": 45}\n", "\n", "# We use xr.merge() to combine the results with the previous dataset.\n", "ds_4fa = xr.merge(\n", " [\n", " ds_4fa,\n", " xh.indicators.get_yearly_op(\n", " masked,\n", " op=\"max\",\n", " timeargs=timeargs_custom,\n", " missing=missing,\n", " missing_options=missing_options,\n", " ),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_4fa.streamflow_max_custom.dropna(\"time\", how=\"all\").hvplot(\n", " x=\"time\", grid=True, widget_location=\"bottom\", groupby=\"id\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### d) Computing volumes\n", "\n", "The frequency analysis can also be performed on volumes, using a similar workflow. The main difference is that if we're starting from streamflow, we'll first need to convert them into volumes using `xhydro.indicators.compute_volume`. Also, if required, `get_yearly_op` has an argument `interpolate_na` that can be used to interpolate missing data prior to the sum." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Get a daily volume from a daily streamflow\n", "ds[\"volume\"] = xh.indicators.compute_volume(ds[\"streamflow\"], out_units=\"hm3\")\n", "\n", "# We'll take slightly different indexers\n", "timeargs_vol = {\"spring\": {\"date_bounds\": [\"04-30\", \"06-15\"]}, \"annual\": {}}\n", "\n", "# The operation that we want here is the sum, not the max.\n", "ds_4fa = xr.merge(\n", " [\n", " ds_4fa,\n", " xh.indicators.get_yearly_op(\n", " ds,\n", " op=\"sum\",\n", " input_var=\"volume\",\n", " timeargs=timeargs_vol,\n", " missing=\"pct\",\n", " missing_options={\"tolerance\": 0.15},\n", " interpolate_na=True,\n", " ),\n", " ]\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ds_4fa.volume_sum_spring.dropna(\"time\", how=\"all\").hvplot(\n", " x=\"time\", grid=True, widget_location=\"bottom\", groupby=\"id\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Local frequency analysis\n", "\n", "Once we have our yearly maximums (or volumes/minimums), the first step in a local frequency analysis is to call `xhfa.local.fit` to obtain distribution parameters. The options are:\n", "\n", "- `distributions`: a list of [SciPy distributions](https://docs.scipy.org/doc/scipy/reference/stats.html#continuous-distributions). Defaults to: ``[\"expon\", \"gamma\", \"genextreme\", \"genpareto\", \"gumbel_r\", \"pearson3\", \"weibull_min\"]``.\n", "- `min_years`: the minimum number of years required to fit the data.\n", "- `method`: the fitting method. Defaults to the maximum likelihood." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# To speed up the Notebook, we'll only perform the analysis on a subset of variables\n", "params = xhfa.local.fit(\n", " ds_4fa[[\"streamflow_max_spring\", \"volume_sum_spring\"]], min_years=15\n", ")\n", "\n", "params" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Information Criteria such as the `AIC`, `BIC`, and `AICC` are useful to determine which statistical distribution is better suited to a given location. These three criteria can be computed using ``xhfa.local.criteria``." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "criteria = xhfa.local.criteria(\n", " ds_4fa[[\"streamflow_max_spring\", \"volume_sum_spring\"]], params\n", ")\n", "\n", "criteria" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally, return periods can be obtained using `xhfa.local.parametric_quantiles`. The options are:\n", "\n", "- `t`: the return period(s) in years.\n", "- `mode`: whether the return period is the probability of exceedance (`\"max\"`) or non-exceedance (`\"min\"`). Defaults to `\"max\"`." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "rp = xhfa.local.parametric_quantiles(params, t=[20, 100])\n", "\n", "rp.load()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a future release, plotting will be handled by a proper function. For now, we'll show an example in this Notebook using preliminary utilities.\n", "\n", "`xhfa.local._prepare_plots` generates datapoints required to plot the results of the frequency analysis. If `log=True`, it will return log-spaced x values between `xmin` and `xmax`.\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "data = xhfa.local._prepare_plots(params, xmin=1, xmax=1000, npoints=50, log=True)\n", "data.load()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "`xhfa.local._get_plotting_positions` allows you to get plotting positions for all variables in the dataset. It accepts `alpha` `beta` arguments. See the [SciPy documentation](https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mstats.plotting_positions.html) for typical values. By default, (0.4, 0.4) will be used, which corresponds to approximately quantile unbiased (Cunnane).\n", "\n" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "pp = xhfa.local._get_plotting_positions(ds_4fa[[\"streamflow_max_spring\"]])\n", "pp" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's plot the observations\n", "p1 = data.streamflow_max_spring.hvplot(\n", " x=\"return_period\", by=\"scipy_dist\", grid=True, groupby=[\"id\"], logx=True\n", ")\n", "data.streamflow_max_spring.hvplot(\n", " x=\"return_period\", by=\"scipy_dist\", grid=True, groupby=[\"id\"], logx=True\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Let's now plot the distributions\n", "p2 = pp.hvplot.scatter(\n", " x=\"streamflow_max_spring_pp\",\n", " y=\"streamflow_max_spring\",\n", " grid=True,\n", " groupby=[\"id\"],\n", " logx=True,\n", ")\n", "pp.hvplot.scatter(\n", " x=\"streamflow_max_spring_pp\",\n", " y=\"streamflow_max_spring\",\n", " grid=True,\n", " groupby=[\"id\"],\n", " logx=True,\n", ")" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# And now combining the plots\n", "p1 * p2" ] } ], "metadata": { "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" }, "vscode": { "interpreter": { "hash": "e28391989cdb8b31df72dd917935faad186df3329a743c813090fc6af977a1ca" } } }, "nbformat": 4, "nbformat_minor": 4 }