1. GIS operations¶

Geographic Information System (GIS) operations are crucial in hydrological analysis. This section illustrates how to leverage the xHydro library for performing key GIS tasks, including delineating watershed boundaries and extracting critical physiographic, climatological, and geographical variables at the watershed scale.

[13]:

import leafmap
import numpy as np
import pandas as pd
import xarray as xr
import xclim
import xdatasets
from IPython.display import clear_output

import xhydro as xh
import xhydro.gis as xhgis

clear_output(wait=False)

1.1. Watershed delineation¶

Currently, watershed delineation utilizes HydroSHEDS’ HydroBASINS (hybas_na_lev01-12_v1c) and is compatible with any location in North America. The process involves identifying all upstream sub-basins from a specified outlet and consolidating them into a unified watershed. The leafmap library is used to generate interactive maps, which allow for the selection of outlets or the visualization of the resulting watershed boundaries. While the use of the map is not mandatory for performing the calculations, it greatly enhances the visualization and understanding of the watershed delineation process.

[14]:

m = leafmap.Map(center=(48.63, -74.71), zoom=5, basemap="USGS Hydrography")

1.1.1. a) From a list of coordinates¶

A first option is to provide a list of coordinates. In the example below, we select two pour points, with each one representing the outlet for the watersheds of Montmorency and the Beaurivage River, respectively.

[15]:

lng_lat = [
    (-71.15069, 46.89431),  # Montmorency river watershed
    (-71.28512, 46.66141),  # Beaurivage river watershed
]

1.1.2. b) From markers on a map¶

Rather than using a list, a more interactive approach allows for directly selecting outlets from the existing map m, using the Draw a marker button located on the left of the map. The image below demonstrates how to select pour points by dragging markers to the desired locations on the map.

After selecting points using either approach a) or b), or a combination of both, we can initiate the watershed delineation calculation. This is done using the function xhydro.gis.watershed_delineation.

[16]:

help(xhgis.watershed_delineation)

Help on function watershed_delineation in module xhydro.gis:

watershed_delineation(
    *,
    coordinates: 'list[tuple] | tuple | None' = None,
    m: 'leafmap.Map | None' = None
) -> 'gpd.GeoDataFrame'
    Calculate watershed delineation from pour point.

    Watershed delineation can be computed at any location in North America using HydroBASINS (hybas_na_lev01-12_v1c).
    The process involves assessing all upstream sub-basins from a specified pour point and consolidating them into a
    unified watershed.

    Parameters
    ----------
    coordinates : list of tuple, tuple, optional
        Geographic coordinates (longitude, latitude) for the location where watershed delineation will be conducted.
    m : leafmap.Map, optional
        If the function receives both a map and coordinates as inputs, it will generate and display watershed
        boundaries on the map. Additionally, any markers present on the map will be transformed into
        corresponding watershed boundaries for each marker.

    Returns
    -------
    gpd.GeoDataFrame
        GeoDataFrame containing the watershed boundaries.

    Warnings
    --------
    This function relies on an Amazon S3-hosted dataset to delineate watersheds.

[17]:

gdf = xhgis.watershed_delineation(coordinates=lng_lat, m=m)
gdf

[17]:

	HYBAS_ID	Upstream Area (sq. km)	geometry
0	7120034480	1157.6	POLYGON ((-71.15208 46.8838, -71.15612 46.8869...
1	7120365812	693.5	POLYGON ((-71.09758 46.40035, -71.09409 46.403...

The outcomes are stored in a GeoPandas gpd.GeoDataFrame object, enabling us to save the polygons in various common formats, such as ESRI Shapefile or GeoJSON. If a map m is available, the polygons will be automatically added to it. To visualize the map, simply type m in the code cell to render it. If displaying the map directly is not compatible with your notebook interpreter, you can use the following code to extract the HTML from the map and plot it:

[18]:

m.zoom_to_gdf(gdf)

1.1.3. c) From xdatasets ¶

While automatically delineating watershed boundaries is a valuable tool, users are encouraged to utilize official watershed boundaries when available, rather than generating new ones. The xdatasets library, for example, hosts a few official boundaries that can be accessed using the xdatasets.Query function. As of today, the following watershed sources are available:

Source	Dataset name
(Canada) National Water Data Archive: HYDAT	hydat_polygons
(Quebec) Provincial Ministry of Environment (MELCCFP/DPEH)	deh_polygons
(Quebec) Hydro-Quebec	hq_polygons

[19]:

xdatasets.Query(
    **{
        "datasets": {
            "deh_polygons": {
                "id": ["031502", "0421*"],
                "regulated": ["Natural"],
            }
        }
    }
).data.reset_index()

[19]:

	Station	Superficie	geometry
0	031502	15.708960	POLYGON ((-72.50126 46.21216, -72.50086 46.213...
1	042103	579.479614	POLYGON ((-78.49014 46.64514, -78.4901 46.6450...

1.2. Extraction of watershed properties¶

Once the watershed boundaries are obtained, we can extract valuable properties such as geographical information, land use classification, and climatological data from the delineated watersheds.

1.2.1. a) Geographical watershed properties¶

First, xhydro.gis.watershed_properties can be used to extract the geographical properties of the watershed, including the perimeter, total area, Gravelius coefficient, and basin centroid. It is important to note that this function returns all the columns present in the provided gpd.GeoDataFrame argument.

[20]:

help(xhgis.watershed_properties)

Help on function watershed_properties in module xhydro.gis:

watershed_properties(
    gdf: 'gpd.GeoDataFrame',
    *,
    unique_id: 'str | None' = None,
    projected_crs: 'int | str | None' = 'NAD83',
    output_format: "Literal['xarray', 'xr.Dataset', 'geopandas', 'gpd.GeoDataFrame']" = 'geopandas'
) -> 'gpd.GeoDataFrame | xr.Dataset'
    Watershed properties extracted from a gpd.GeoDataFrame.

    The calculated properties are :
    - area
    - perimeter
    - gravelius
    - centroid coordinates

    Parameters
    ----------
    gdf : gpd.GeoDataFrame
        GeoDataFrame containing watershed polygons. Must have a defined .crs attribute.
    unique_id : str, optional
        The column name in the GeoDataFrame that serves as a unique identifier.
    projected_crs : int | str
        The projected coordinate reference system (crs) to utilize for calculations, such as determining watershed area.
        If a string is provided, it should be a valid Geodetic CRS for the `gpd.estimate_utm_crs()` method.
        If None, the function will use the `gpd.estimate_utm_crs()` default (WGS 84).
        Default is an estimated CRS based on NAD83.
    output_format : str
        One of either `xarray` (or `xr.Dataset`) or `geopandas` (or `gpd.GeoDataFrame`).

    Returns
    -------
    gpd.GeoDataFrame or xr.Dataset
        Output dataset containing the watershed properties.

[21]:

xhgis.watershed_properties(gdf)

[21]:

	HYBAS_ID	Upstream Area (sq. km)	area (m2)	estimated_area_diff (%)	perimeter (m)	gravelius (m/m)	centroid_lon	centroid_lat
0	7120034480	1157.6	1.160550e+09	0.254842	263428.762122	2.181355	-71.091980	47.257839
1	7120365812	693.5	5.855856e+08	-15.560840	158295.639822	1.845308	-71.260464	46.452161

For added convenience, we can also retrieve the same results in the form of an xarray.Dataset:

[22]:

xhgis.watershed_properties(gdf, unique_id="HYBAS_ID", output_format="xarray")

1.2.2. b) Surface properties¶

We can use xhydro.gis.surface_properties to extract surface properties for the same gpd.GeoDataFrame, such as slope and aspect. By default, these properties are calculated using Copernicus’ GLO-90 Digital Elevation Model as of 2021-04-22. However, both the source and the date can be modified through the function’s arguments.

[23]:

help(xhgis.surface_properties)

Help on function surface_properties in module xhydro.gis:

surface_properties(
    gdf: 'gpd.GeoDataFrame',
    *,
    unique_id: 'str | None' = None,
    projected_crs: 'int | str | None' = 'NAD83',
    operation: 'str' = 'mean',
    dataset_date: 'str' = '2021-04-22',
    collection: 'str' = 'cop-dem-glo-90',
    output_format: "Literal['xarray', 'xr.Dataset', 'geopandas', 'gpd.GeoDataFrame']" = 'geopandas'
) -> 'gpd.GeoDataFrame | xr.Dataset'
    Surface properties for watersheds.

    Surface properties are calculated using Copernicus's GLO-90 Digital Elevation Model.
    By default, the dataset has a geographic coordinate system (EPSG: 4326) and this function expects a projected crs for more accurate results.

    The calculated properties are :
    - elevation (meters)
    - slope (degrees)
    - aspect ratio (degrees)

    Parameters
    ----------
    gdf : gpd.GeoDataFrame
        GeoDataFrame containing watershed polygons. Must have a defined .crs attribute.
    unique_id : str, optional
        The column name in the GeoDataFrame that serves as a unique identifier.
    projected_crs : int | str
        The projected coordinate reference system (crs) to utilize for calculations, such as determining watershed area.
        If a string is provided, it should be a valid Geodetic CRS for the `gpd.estimate_utm_crs()` method.
        If None, the function will use the `gpd.estimate_utm_crs()` default (WGS 84).
        Default is an estimated CRS based on NAD83.
    operation : str
        Aggregation statistics such as `mean` or `sum`.
    dataset_date : str
        Date (%Y-%m-%d) for which to select the imagery from the dataset. Date must be available.
    collection : str
        Collection name from the Planetary Computer STAC Catalog. Default is `cop-dem-glo-90`.
    output_format : str
        One of either `xarray` (or `xr.Dataset`) or `geopandas` (or `gpd.GeoDataFrame`).

    Returns
    -------
    gpd.GeoDataFrame or xr.Dataset
        Output dataset containing the surface properties.

    Warnings
    --------
    This function relies on the Microsoft Planetary Computer's STAC Catalog to retrieve the Digital Elevation Model (DEM) data.

[24]:

xhgis.surface_properties(gdf)

[24]:

	elevation	slope	aspect	time	spatial_ref
geometry
0	725.898315	5.307655	186.393692	2021-04-22	0
1	222.474442	1.356003	212.094788	2021-04-22	0

Again, for convenience, we can output the results in xarray.Dataset format :

[25]:

xhgis.surface_properties(gdf, unique_id="HYBAS_ID", output_format="xarray")

1.2.4. d) Climate indicators¶

The extraction of climate indicators is the most complex step, as it requires accessing a weather dataset, then subsetting and averaging the data over the various watersheds within the GeoDataFrame object. These steps are outside the scope of xHydro, and users will need to rely on other libraries for this task. One approach, outlined in the Use Case Example, involves using a combination of xscen and xESMF. Another approach, demonstrated here, utilizes xdatasets. Indeed, xdatasets enables the extraction of all the pixels from a gridded dataset within a watershed while accounting for the weighting of each pixel intersecting the watershed.

For this example, we will use ERA5-Land reanalysis data covering the period from 1981 to 2010 as our climatological dataset.

[29]:

datasets = {
    "era5_land_reanalysis": {"variables": ["t2m", "tp"]},
}
space = {
    "clip": "polygon",  # bbox, point or polygon
    "averaging": True,
    "geometry": gdf,
    "unique_id": "HYBAS_ID",
}
time = {
    "start": "1981-01-01",
    "end": "1982-12-31",
    "timezone": "America/Montreal",
    "timestep": "D",
    "aggregation": {"tp": np.nansum, "t2m": np.nanmean},
}

ds = xdatasets.Query(datasets=datasets, space=space, time=time).data.squeeze()

Temporal operations: processing tp with era5_land_reanalysis: 100%|██████████| 2/2 [00:00<00:00, 34.55it/s]

Because the next steps utilize xclim under the hood, the dataset must be CF-compliant. At a minimum, the xarray.DataArray used should adhere to the following principles:

The dataset must include a time dimension, typically at a daily frequency, with no missing timesteps (although NaNs are supported). If your data deviates from this format, extra caution should be taken when interpreting the results.
If there is a 1D spatial dimension, such as HYBAS_ID in the example below, it must have an attribute cf_role with the value timeseries_id.
The variable must include at least a units attribute. While not mandatory, other attributes such as long_name and cell_methods are expected by xclim, and warnings will be raised if they are missing.
Variable names should match those supported by xclim.

The following code will format the ERA5-Land data obtained from xdatasets to add missing metadata and change the variable names and units:

[30]:

ds = ds.rename({"t2m_nanmean": "tas", "tp_nansum": "pr"})
ds["tas"] = xclim.core.units.convert_units_to(ds["tas"], "degC")
ds["tas"].attrs.update({"cell_methods": "time: mean"})

ds["pr"].attrs.update({"units": "m d-1", "cell_methods": "time: mean within days"})
ds["pr"] = xclim.core.units.convert_units_to(ds["pr"], "mm d-1")
ds

Climate indicators can be calculated in various ways, and for most simple tasks, directly using xclim is always a viable option. xclim offers an extensive list of available indicators within a user-friendly and flexible framework.

[31]:

xclim.atmos.tg_max(ds.tas)

For more complex tasks, or when computing multiple indicators simultaneously, xhydro.indicators.compute_indicators is often the preferred method. This function allows users to build and pass multiple indicators at once, either by providing a list of custom indicators created with xclim.core.indicator.Indicator.from_dict (see the INFO box below) or by referencing a YAML file (see the ``xscen` documentation <https://xscen.readthedocs.io/en/latest/notebooks/2_getting_started.html#Computing-indicators>`__).

INFO

Custom indicators in xHydro are built by following the YAML formatting required by xclim.

A custom indicator built using xclim.core.indicator.Indicator.from_dict will need these elements:

“data”: A dictionary with the following information:
- “base”: The “YAML ID” obtained from here.
- “input”: A dictionary linking the default xclim input to the name of your variable. Needed only if it is different. In the link above, they are the string following “Uses:”.
- “parameters”: A dictionary containing all other parameters for a given indicator. In the link above, the easiest way to access them is by clicking the link in the top-right corner of the box describing a given indicator.
- More entries can be used here, as described in the xclim documentation under “identifier”.
“identifier”: A custom name for your indicator. This will be the name returned in the results.
“module”: Needed, but can be anything. To prevent an accidental overwriting of xclim indicators, it is best to use something different from: [“atmos”, “land”, “generic”].

[32]:

help(xh.indicators.compute_indicators)

Help on function compute_indicators in module xscen.indicators:

compute_indicators(
    ds: xarray.core.dataset.Dataset,
    indicators: str | os.PathLike | collections.abc.Sequence[xclim.core.indicator.Indicator] | collections.abc.Sequence[tuple[str, xclim.core.indicator.Indicator]] | module,
    *,
    periods: list[str] | list[list[str]] | None = None,
    restrict_years: bool = True,
    to_level: str | None = 'indicators',
    rechunk_input: bool = False
) -> dict
    Calculate variables and indicators based on a YAML call to xclim.

    The function cuts the output to be the same years as the inputs.
    Hence, if an indicator creates a timestep outside the original year range (e.g. the first DJF for QS-DEC),
    it will not appear in the output.

    Parameters
    ----------
    ds : xr.Dataset
        Dataset to use for the indicators.
    indicators : str | os.PathLike | Sequence[Indicator] | Sequence[tuple[str, Indicator]] | ModuleType
        Path to a YAML file that instructs on how to calculate missing variables.
        Can also be only the "stem", if translations and custom indices are implemented.
        Can be the indicator module directly, or a sequence of indicators or a sequence of
        tuples (indicator name, indicator) as returned by `iter_indicators()`.
    periods : list of str or list of lists of str, optional
        Either [start, end] or list of [start, end] of continuous periods over which to compute the indicators.
        This is needed when the time axis of ds contains some jumps in time.
        If None, the dataset will be considered continuous.
    restrict_years : bool
        If True, cut the time axis to be within the same years as the input.
        This is mostly useful for frequencies that do not start in January, such as QS-DEC.
        In that instance, `xclim` would start on previous_year-12-01 (DJF), with a NaN.
        `restrict_years` will cut that first timestep. This should have no effect on YS and MS indicators.
    to_level : str, optional
        The processing level to assign to the output.
        If None, the processing level of the inputs is preserved.
    rechunk_input : bool
        If True, the dataset is rechunked with :py:func:`flox.xarray.rechunk_for_blockwise`
        according to the resampling frequency of the indicator. Each rechunking is done
        once per frequency with :py:func:`xscen.utils.rechunk_for_resample`.

    Returns
    -------
    dict
        Dictionary (keys = timedeltas) with indicators separated by temporal resolution.

    See Also
    --------
    xclim.indicators, xclim.core.indicator.build_indicator_module_from_yaml

[33]:

indicators = [
    # Maximum summer temperature
    xclim.core.indicator.Indicator.from_dict(
        data={
            "base": "tg_max",
            "parameters": {"indexer": {"month": [6, 7, 8]}},
        },
        identifier="tg_max_summer",
        module="example",
    ),
    # Number of days with more than 5 mm of precipitation
    xclim.core.indicator.Indicator.from_dict(
        data={
            "base": "wetdays",
            "parameters": {"thresh": "5 mm d-1"},
        },
        identifier="wet5mm",
        module="example",
    ),
]

xh.indicators.compute_indicators(ds, indicators)["YS-JAN"]

Finally, xhydro.indicators.get_yearly_op, also built upon the xclim library, provides a flexible method for obtaining yearly values of specific statistics, such as annual or seasonal maxima. This can be particularly useful, for instance, when extracting raw data needed for frequency analyses.

[34]:

help(xh.indicators.get_yearly_op)

Help on function get_yearly_op in module xhydro.indicators.generic:

get_yearly_op(
    ds,
    op,
    *,
    input_var: str = 'q',
    window: int = 1,
    timeargs: dict | None = None,
    missing: str = 'skip',
    missing_options: dict | None = None,
    interpolate_na: bool = False
) -> xarray.core.dataset.Dataset
    Compute yearly operations on a variable.

    Parameters
    ----------
    ds : xr.Dataset
        Dataset containing the variable to compute the operation on.
    op : str
        Operation to compute. One of ["max", "min", "mean", "sum"].
    input_var : str
        Name of the input variable. Defaults to "q".
    window : int
        Size of the rolling window. A "mean" operation is performed on the rolling window before the call to xclim.
        This parameter cannot be used with the "sum" operation.
    timeargs : dict, optional
        Dictionary of time arguments for the operation.
        Keys are the name of the period that will be added to the results (e.g. "winter", "summer", "annual").
        Values are up to two dictionaries, with both being optional.
        The first is {'freq': str}, where str is a frequency supported by xarray (e.g. "YS", "YS-JAN", "YS-DEC").
        It needs to be a yearly frequency. Defaults to "YS-JAN".
        The second is an indexer as supported by :py:func:`xclim.core.calendar.select_time`.
        Defaults to {}, which means the whole year.
        See :py:func:`xclim.core.calendar.select_time` for more information.
        Examples: {"winter": {"freq": "YS-DEC", "date_bounds": ["12-01", "02-28"]}}, {"jan": {"freq": "YS", "month": 1}}, {"annual": {}}.
    missing : str
        How to handle missing values. One of "skip", "any", "at_least_n", "pct", "wmo".
        See :py:func:`xclim.core.missing` for more information.
    missing_options : dict, optional
        Dictionary of options for the missing values' method. See :py:func:`xclim.core.missing` for more information.
    interpolate_na : bool
        Whether to interpolate missing values before computing the operation. Only used with the "sum" operation.
        Defaults to False.

    Returns
    -------
    xr.Dataset
        Dataset containing the computed operations, with one variable per indexer.
        The name of the variable follows the pattern `{input_var}{window}_{op}_{indexer}`.

    Notes
    -----
    If you want to perform a frequency analysis on a frequency that is finer than annual, simply use multiple timeargs
    (e.g. 1 per month) to create multiple distinct variables.

The timeargs argument relies on indexers that are compatible with xclim.core.calendar.select_time. Four types of indexers are currently accepted:

month, followed by a sequence of month numbers.
season, followed by one or more of 'DJF', 'MAM', 'JJA', and 'SON'.
doy_bounds, followed by a sequence representing the inclusive bounds of the period to be considered ('start', 'end').
date_bounds, similar to doy_bounds, but using a month-day ('%m-%d') format.

Subsequently, we specify the operations to be calculated for each variable. Supported operations include "max", "min", "mean", and "sum".

[35]:

timeargs = {
    "january": {"month": [1]},
    "spring": {"date_bounds": ["02-11", "06-19"]},
    "summer_fall": {"date_bounds": ["06-20", "11-19"]},
    "year": {"date_bounds": ["01-01", "12-31"]},
}

operations = {
    "tas": ["max", "mean"],
    "pr": ["sum"],
}

The combination of timeargs and operations through the Cartesian product enables the efficient generation of a comprehensive array of climate indicators.

[36]:

ds_climatology = xr.merge(
    [
        xh.indicators.get_yearly_op(ds, input_var=variable, op=op, timeargs=timeargs)
        for (variable, ops) in operations.items()
        for op in ops
    ]
)
ds_climatology

The same data can also be visualized as a pd.DataFrame as well :

[37]:

pd.set_option("display.max_rows", 100)
ds_climatology.mean("time").to_dataframe().T

[37]:

HYBAS_ID	7120034480	7120365812
spatial_agg	polygon	polygon
timestep	D	D
source	era5_land_reanalysis	era5_land_reanalysis
tas_max_january	-5.979976	-1.652047
tas_max_spring	17.686927	20.677712
tas_max_summer_fall	20.757984	24.204149
tas_max_year	20.757984	24.204149
tas_mean_january	-18.967689	-16.441736
tas_mean_spring	0.424023	4.184447
tas_mean_summer_fall	9.293055	12.396911
tas_mean_year	1.134716	4.487463
pr_sum_january	61.612786	57.209695
pr_sum_spring	467.706516	393.650887
pr_sum_summer_fall	614.891367	639.558307
pr_sum_year	1351.15288	1290.998159