Loading Data Using Dask

Loading Data Using Dask#

This notebook demonstrates how to load unstructured grid datasets into UXarray using Dask. It covers:

Loading high-resolution grid files: Includes a discussion on lazily converting these files to UGRID conventions.
Parallelizing UGRID conversion using chunking: Explains how to use chunking to distribute the workload across Dask workers.
Loading large datasets paired with grid files: Explores strategies for efficiently handling large datasets, including scenarios involving many individual data files.

import warnings

import dask
import xarray as xr
from dask.distributed import Client, LocalCluster

import uxarray as ux

warnings.filterwarnings("ignore")

Dask Setup#

This notebook runs on a single node of NCAR Derecho’s Supercomputer. Below, we set up our local cluster and client with Dask.

For more information about running Dask on NCAR’s systems, please refer to NCAR Dask Tutorial.

cluster = LocalCluster()
client = Client(cluster)
client

Client

Client-7e25e4ed-f549-11ef-9f0e-0040a687f9c6

Connection method: Cluster object	Cluster type: distributed.LocalCluster
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/42425/status

Cluster Info

LocalCluster

11a4b627

Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/42425/status	Workers: 16
Total threads: 256	Total memory: 235.00 GiB
Status: running	Using processes: True

Scheduler Info

Scheduler

Scheduler-8413c8c1-f22a-4446-b3ad-53329e1168ac

Comm: tcp://127.0.0.1:44181	Workers: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/42425/status	Total threads: 256
Started: Just now	Total memory: 235.00 GiB

Workers

Worker: 0

Comm: tcp://127.0.0.1:34973	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/37815/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:39891
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-tcgsm3ek

Worker: 1

Comm: tcp://127.0.0.1:41339	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/42903/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:44825
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-vuaxoti6

Worker: 2

Comm: tcp://127.0.0.1:36633	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/39069/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:34215
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-zusmbn1g

Worker: 3

Comm: tcp://127.0.0.1:37295	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/33913/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:42927
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-_xj8dtf9

Worker: 4

Comm: tcp://127.0.0.1:41503	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/36971/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:41121
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-w99wtpo_

Worker: 5

Comm: tcp://127.0.0.1:39379	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/40093/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:46707
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-02yeefst

Worker: 6

Comm: tcp://127.0.0.1:34871	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/42509/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:41467
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-b9o13vhq

Worker: 7

Comm: tcp://127.0.0.1:43099	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/44871/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:39547
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-xibky4zq

Worker: 8

Comm: tcp://127.0.0.1:43377	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/37593/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:41397
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-r7klrxuo

Worker: 9

Comm: tcp://127.0.0.1:41855	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/32959/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:43223
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-cpargnh1

Worker: 10

Comm: tcp://127.0.0.1:39769	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/40619/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:42241
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-bm0cmdnv

Worker: 11

Comm: tcp://127.0.0.1:41411	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/37061/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:43461
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-xywtai0m

Worker: 12

Comm: tcp://127.0.0.1:41523	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/38965/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:42557
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-ymcjanca

Worker: 13

Comm: tcp://127.0.0.1:41495	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/44855/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:43355
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-eovwsl_o

Worker: 14

Comm: tcp://127.0.0.1:34151	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/36609/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:44423
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-kjko5tvk

Worker: 15

Comm: tcp://127.0.0.1:36491	Total threads: 16
Dashboard: https://jupyterhub.hpc.ucar.edu/stable/user/philipc/proxy/36481/status	Memory: 14.69 GiB
Nanny: tcp://127.0.0.1:46507
Local directory: /glade/derecho/scratch/philipc/tmp/dask-scratch-space/worker-k9s7w341

Data#

This notebook uses two datasets to demonstrate Dask functionality.

The first dataset is a 3.75 km MPAS Atmosphere Grid paired with a single diagnostic file.

The second dataset comes from the Department of Energy (DOE) Energy Exascale Earth System Model (E3SM). The case is configured as follows:

Atmosphere-only (AMIP)
Present-day control forcing (F2010)
1-degree horizontal resolution (ne30pg2)
Default Sea surface temperatures and sea ice

Special thanks to Falko Judt (NSF NCAR MMM) and Rachel Tam (UIUC) for sharing the data with us! sharing the data with us!

mpas_grid_path = "/glade/campaign/cisl/vast/uxarray/data/dyamond/3.75km/grid.nc"
mpas_data_path = "/glade/campaign/mmm/wmr/fjudt/projects/dyamond_1/3.75km/diag.2016-08-01_00.00.00.nc"

e3sm_grid_path = (
    "/glade/campaign/cisl/vast/uxarray/data/e3sm_keeling/E3SM_grid/ne30pg2_grd.nc"
)
e3sm_data_pattern = "/glade/campaign/cisl/vast/uxarray/data/e3sm_keeling/ENSO_ctl_1std/unstructured/*.nc"

Loading Large Grid Files#

UXarray represents every grid format using the UGRID conventions, which often requires multiple pre-processing steps on the original grid data. These steps typically include:

Converting from 1-indexed to 0-indexed connectivity variables.
Replacing existing fill values with our standardized INT_FILL_VALUE.
Shifting longitude coordinates to the range [-180, 180].

Many of these operations are relatively simple and can be delayed until the variable is needed. By loading the data as Dask arrays rather than directly into memory, we can defer these computations while still creating a Grid instance. An added benefit is that only the required variables are computed when accessed, which is useful since grid files often contain additional variables that may not be immediately needed.

Most of UXarray’s supported grid formats allow for a lazy conversion to UGRID.

Supported (Fully supports lazy conversions with Dask):

UGRID
MPAS
ICON
ESMF
HEALPix
EXODUS
FESOM (netCDF)

Currently Unsupported:

SCRIP
GEOS
Structured
Points
FESOM (ascii)

Let’s examine an extreme case. Below, we have a complete 3.75 km MPAS atmosphere grid that contains a full set of grid variables, including multiple coordinates and connectivity variables.

First, let’s try to eagerly load the entire grid into memory without specifying any chunks. chunking.

This takes over a minute on a node of Derecho. We can observe that our Grid contains a large number of variables, many of which we may never end up using.

Compared to using Xarray directly, this represents a significant performance difference. Note that because UXarray requires all grids to be internally represented using UGRID conventions, loading a Grid will always be slower than a pure Xarray approach. .

%%time
xrds = xr.open_dataset(mpas_grid_path)

CPU times: user 16.3 ms, sys: 105 ms, total: 121 ms
Wall time: 107 ms

One workaround is to specify a chunks parameter, which uses Dask to load the grid variables. Because Dask allows computations to be delayed, we can defer these operations until they’re necessary, significantly reducing the time required to open a grid and explore its contents.

Below, we set chunks=-1, which loads all of our data as Dask arrays, using a single chunk per variable. .

%%time
uxgrid = ux.open_grid(mpas_grid_path, chunks=-1)
uxgrid

CPU times: user 1.67 s, sys: 1.26 s, total: 2.93 s
Wall time: 14.2 s

<uxarray.Grid>
Original Grid Type: MPAS
Grid Dimensions:
  * n_node: 83886080
  * n_edge: 125829120
  * n_face: 41943042
  * n_max_face_nodes: 6
  * n_max_face_edges: 6
  * n_max_face_faces: 6
  * n_max_node_faces: 3
  * two: 2
Grid Coordinates (Spherical):
  * node_lon: (83886080,)
  * node_lat: (83886080,)
  * edge_lon: (125829120,)
  * edge_lat: (125829120,)
  * face_lon: (41943042,)
  * face_lat: (41943042,)
Grid Coordinates (Cartesian):
  * node_x: (83886080,)
  * node_y: (83886080,)
  * node_z: (83886080,)
  * edge_x: (125829120,)
  * edge_y: (125829120,)
  * edge_z: (125829120,)
  * face_x: (41943042,)
  * face_y: (41943042,)
  * face_z: (41943042,)
Grid Connectivity Variables:
  * edge_face_connectivity: (125829120, 2)
  * node_face_connectivity: (83886080, 3)
  * face_edge_connectivity: (41943042, 6)
  * edge_node_connectivity: (125829120, 2)
  * face_node_connectivity: (41943042, 6)
  * face_face_connectivity: (41943042, 6)
Grid Descriptor Variables:
  * face_areas: (41943042,)
  * edge_face_distances: (125829120,)
  * edge_node_distances: (125829120,)

We can see above that all our variables are loaded as dask.array objects. By inspecting the high-level Dask graph for face_node_connectivity, we can observe the complete set of computations and steps taken to parse and encode the data according to the UGRID conventions.

uxgrid.face_node_connectivity.data.dask

HighLevelGraph

HighLevelGraph with 17 layers and 17 keys from all layers.

Layer1: original

original-open_dataset-verticesOnCell-bf7a984439bb267dd3463be0321be53d

layer_type	MaterializedLayer
is_materialized	True
number of outputs	1

Layer2: open_dataset-verticesOnCell

open_dataset-verticesOnCell-bf7a984439bb267dd3463be0321be53d

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	int32
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	original-open_dataset-verticesOnCell-bf7a984439bb267dd3463be0321be53d

Layer3: astype

astype-c92bc80ebc54956fcc28078df192a41f

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	int64
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	open_dataset-verticesOnCell-bf7a984439bb267dd3463be0321be53d

Layer4: original

original-open_dataset-nEdgesOnCell-acc252cd4d77eb187160b7314ab2a543

layer_type	MaterializedLayer
is_materialized	True
number of outputs	1

Layer5: open_dataset-nEdgesOnCell

open_dataset-nEdgesOnCell-acc252cd4d77eb187160b7314ab2a543

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042,)
dtype	int32
chunksize	(41943042,)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	original-open_dataset-nEdgesOnCell-acc252cd4d77eb187160b7314ab2a543

Layer6: astype

astype-fbdcb7f49fdae873ba694bac7fc48701

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042,)
dtype	int64
chunksize	(41943042,)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	open_dataset-nEdgesOnCell-acc252cd4d77eb187160b7314ab2a543

Layer7: getitem

getitem-149ca190fe8db177a5211959d67d7ee1

layer_type	MaterializedLayer
is_materialized	True
number of outputs	1
shape	(1, 41943042)
dtype	int64
chunksize	(1, 41943042)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	astype-fbdcb7f49fdae873ba694bac7fc48701

Layer8: array

array-2a2bc8dce6a9b78455d5fb439011f414

layer_type	MaterializedLayer
is_materialized	True
number of outputs	1
shape	(6, 1)
dtype	int64
chunksize	(6, 1)
type	dask.array.core.Array
chunk_type	numpy.ndarray

Layer9: greater_equal

greater_equal-2d1a7a47fd227551905b779fce997c1f

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(6, 41943042)
dtype	bool
chunksize	(6, 41943042)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	getitem-149ca190fe8db177a5211959d67d7ee1
	array-2a2bc8dce6a9b78455d5fb439011f414

Layer10: invert

invert-9f8a3423432fa8a6c044a6077f651ab6

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(6, 41943042)
dtype	bool
chunksize	(6, 41943042)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	greater_equal-2d1a7a47fd227551905b779fce997c1f

Layer11: transpose

transpose-27a470fc1f21f1ecf2b93e74b2929b02

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	bool
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	invert-9f8a3423432fa8a6c044a6077f651ab6

Layer12: where

where-8d58b872c1ad4b269110802d019c6040

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	int64
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	astype-c92bc80ebc54956fcc28078df192a41f
	transpose-27a470fc1f21f1ecf2b93e74b2929b02

Layer13: ne

ne-5373f6ebc9c1a1bad77b4422984d3ec4

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	bool
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	where-8d58b872c1ad4b269110802d019c6040

Layer14: where

where-674faff89656f343b9d92ed9352dd276

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	int64
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	ne-5373f6ebc9c1a1bad77b4422984d3ec4
	where-8d58b872c1ad4b269110802d019c6040

Layer15: sub

sub-2e8e6ffdaeb8aca6261d37bc892ec282

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	int64
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	where-674faff89656f343b9d92ed9352dd276

Layer16: ne

ne-0f992b881c4b11dd6dd37e20b682661f

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	bool
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	where-674faff89656f343b9d92ed9352dd276

Layer17: where

where-220ec0d25750ec28efbf84f9a4cf39bb

layer_type	Blockwise
is_materialized	False
number of outputs	1
shape	(41943042, 6)
dtype	int64
chunksize	(41943042, 6)
type	dask.array.core.Array
chunk_type	numpy.ndarray
depends on	sub-2e8e6ffdaeb8aca6261d37bc892ec282
	ne-0f992b881c4b11dd6dd37e20b682661f
	where-674faff89656f343b9d92ed9352dd276

If we want to load this variable into memory, we can use either the .load() or .compute() methods:

.load() performs an in-place loading.
.compute() returns a new variable with the data loaded into memory.

For example, to load the face_node_connectivity variable into memory, you would do the following:

# load the variable in place
uxgrid.face_node_connectivity.load()

# create a new variable and assign it to the original using compute
uxgrid.face_node_connectivity_loaded = uxgrid.face_node_connectivity.compute()

Inspecting our Grid once again, we see that after these computations, only the face_node_connectivity variable is loaded into memory, while the remaining variables remain as Dask arrays.

uxgrid

<uxarray.Grid>
Original Grid Type: MPAS
Grid Dimensions:
  * n_node: 83886080
  * n_edge: 125829120
  * n_face: 41943042
  * n_max_face_nodes: 6
  * n_max_face_edges: 6
  * n_max_face_faces: 6
  * n_max_node_faces: 3
  * two: 2
Grid Coordinates (Spherical):
  * node_lon: (83886080,)
  * node_lat: (83886080,)
  * edge_lon: (125829120,)
  * edge_lat: (125829120,)
  * face_lon: (41943042,)
  * face_lat: (41943042,)
Grid Coordinates (Cartesian):
  * node_x: (83886080,)
  * node_y: (83886080,)
  * node_z: (83886080,)
  * edge_x: (125829120,)
  * edge_y: (125829120,)
  * edge_z: (125829120,)
  * face_x: (41943042,)
  * face_y: (41943042,)
  * face_z: (41943042,)
Grid Connectivity Variables:
  * edge_face_connectivity: (125829120, 2)
  * node_face_connectivity: (83886080, 3)
  * face_edge_connectivity: (41943042, 6)
  * edge_node_connectivity: (125829120, 2)
  * face_node_connectivity: (41943042, 6)
  * face_face_connectivity: (41943042, 6)
Grid Descriptor Variables:
  * face_areas: (41943042,)
  * edge_face_distances: (125829120,)
  * edge_node_distances: (125829120,)

Chunking Grid Dimensions#

Our grid consists of 41,943,042 faces, 83,886,080 nodes, and 125,829,120 edges. Instead of having a single chunk for each variable, we can consider chunking each individual variable across the grid dimensions.

By chunking the variables when loading them, we can distribute the work evenly across our Dask workers. The operations applied when encoding the grid format into UGRID conventions are embarrassingly parallelizable.

Recall that on a single node of Derecho, we have 256 available threads. Let’s evenly divide our data across all of our threads.

face_chunk = round(41_943_042 // 256)
node_chunk = round(83_886_080 // 256)
edge_chunk = round(125_829_120 // 256)

We can now specify our chunk parameter by passing a dictionary where each dimension is mapped to its corresponding chunk size.

%%time
uxgrid = ux.open_grid(
    mpas_grid_path,
    chunks={"n_face": face_chunk, "n_node": node_chunk, "n_edge": edge_chunk},
)
uxgrid

CPU times: user 1.44 s, sys: 608 ms, total: 2.04 s
Wall time: 4.59 s

<uxarray.Grid>
Original Grid Type: MPAS
Grid Dimensions:
  * n_node: 83886080
  * n_edge: 125829120
  * n_face: 41943042
  * n_max_face_nodes: 6
  * n_max_face_edges: 6
  * n_max_face_faces: 6
  * n_max_node_faces: 3
  * two: 2
Grid Coordinates (Spherical):
  * node_lon: (83886080,)
  * node_lat: (83886080,)
  * edge_lon: (125829120,)
  * edge_lat: (125829120,)
  * face_lon: (41943042,)
  * face_lat: (41943042,)
Grid Coordinates (Cartesian):
  * node_x: (83886080,)
  * node_y: (83886080,)
  * node_z: (83886080,)
  * edge_x: (125829120,)
  * edge_y: (125829120,)
  * edge_z: (125829120,)
  * face_x: (41943042,)
  * face_y: (41943042,)
  * face_z: (41943042,)
Grid Connectivity Variables:
  * edge_face_connectivity: (125829120, 2)
  * node_face_connectivity: (83886080, 3)
  * face_edge_connectivity: (41943042, 6)
  * edge_node_connectivity: (125829120, 2)
  * face_node_connectivity: (41943042, 6)
  * face_face_connectivity: (41943042, 6)
Grid Descriptor Variables:
  * face_areas: (41943042,)
  * edge_face_distances: (125829120,)
  * edge_node_distances: (125829120,)

Now let’s load in the minimal amount of variables we need. For many applications in UXarray, such as visualization, only the node_lon, node_lat, and face_node_connectivity variables are required.

By calling .load() on each of these variables, we trigger the computation of their conversion to the UGRID conventions and load them into memory.

%%time
uxgrid.face_node_connectivity.load()
uxgrid.node_lon.load()
uxgrid.node_lat.load()
uxgrid

CPU times: user 2.54 s, sys: 2.48 s, total: 5.02 s
Wall time: 8 s

<uxarray.Grid>
Original Grid Type: MPAS
Grid Dimensions:
  * n_node: 83886080
  * n_edge: 125829120
  * n_face: 41943042
  * n_max_face_nodes: 6
  * n_max_face_edges: 6
  * n_max_face_faces: 6
  * n_max_node_faces: 3
  * two: 2
Grid Coordinates (Spherical):
  * node_lon: (83886080,)
  * node_lat: (83886080,)
  * edge_lon: (125829120,)
  * edge_lat: (125829120,)
  * face_lon: (41943042,)
  * face_lat: (41943042,)
Grid Coordinates (Cartesian):
  * node_x: (83886080,)
  * node_y: (83886080,)
  * node_z: (83886080,)
  * edge_x: (125829120,)
  * edge_y: (125829120,)
  * edge_z: (125829120,)
  * face_x: (41943042,)
  * face_y: (41943042,)
  * face_z: (41943042,)
Grid Connectivity Variables:
  * edge_face_connectivity: (125829120, 2)
  * node_face_connectivity: (83886080, 3)
  * face_edge_connectivity: (41943042, 6)
  * edge_node_connectivity: (125829120, 2)
  * face_node_connectivity: (41943042, 6)
  * face_face_connectivity: (41943042, 6)
Grid Descriptor Variables:
  * face_areas: (41943042,)
  * edge_face_distances: (125829120,)
  * edge_node_distances: (125829120,)

Loading Large Datasets#

The previous example focused solely on working with the unstructured grid definition. In most cases, however, you’ll have an unstructured grid paired with data. This may involve loading a large series of data variables from a climate model that include many spatial and temporal dimensions. For these applications, using Dask is highly encouraged, as most machines cannot load all of this data into memory.

.

Opening a Single Data File#

Using the same grid as above, we can pair it with a data file to create a ux.UxDataset. In this example, we have a high-resolution grid paired with a single diagnostic file from MPAS. In this case, we can set chunks=-1 if we simply want our data to be loaded as Dask arrays. .

%%time
uxds = ux.open_dataset(mpas_grid_path, mpas_data_path, chunks=-1)
uxds

CPU times: user 1.62 s, sys: 999 ms, total: 2.62 s
Wall time: 12.8 s

<xarray.UxDataset> Size: 17GB
Dimensions:             (time: 1, StrLen: 64, n_face: 41943042, n_node: 83886080)
Coordinates:
  * time                (time) datetime64[ns] 8B 2016-08-01
Dimensions without coordinates: StrLen, n_face, n_node
Data variables: (12/99)
    xtime_old           (time, StrLen) |S1 64B dask.array<chunksize=(1, 64), meta=np.ndarray>
    taux                (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    tauy                (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    olrtoa              (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    cldcvr              (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    vert_int_qv         (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    ...                  ...
    umeridional_300hPa  (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    umeridional_400hPa  (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    uzonal_300hPa       (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    uzonal_400hPa       (time, n_face) float32 168MB dask.array<chunksize=(1, 41943042), meta=np.ndarray>
    xtime               (time, StrLen) |S1 64B dask.array<chunksize=(1, 64), meta=np.ndarray>
    zgrid               (n_face) float32 168MB dask.array<chunksize=(41943042,), meta=np.ndarray>

Let’s access our "relhum_200hPa" data variable and compute the global mean. Since our data is loaded using Dask, we need to trigger the computation using .compute() or .load(). For example:

%%time
uxds["relhum_200hPa"].mean().compute()

CPU times: user 94.5 ms, sys: 72.9 ms, total: 167 ms
Wall time: 727 ms

<xarray.UxDataArray 'relhum_200hPa' ()> Size: 4B
array(25.246592, dtype=float32)

Opening Multiple Data Files#

There may be times when the grid you are working with is small enough to load directly into memory, while other temporal or spatial dimensions in the dataset can benefit from chunking. In these cases, you can specify the chunk_grid=False parameter to apply chunking only to the additional dimensions.

%%time
uxds = ux.open_mfdataset(
    e3sm_grid_path,
    e3sm_data_pattern,
    # concatenate along this dimension
    concat_dim="time",
    # concatenate files in the order provided
    combine="nested",
    chunks={
        "lev": 4,
    },
    parallel=True,
    # eagerly load grid into memory
    chunk_grid=False,
)
uxds

CPU times: user 19.6 s, sys: 437 ms, total: 20.1 s
Wall time: 17.1 s

<xarray.UxDataset> Size: 37GB
Dimensions:              (time: 72, n_face: 21600, lev: 72, ilev: 73,
                          cosp_prs: 7, nbnd: 2, cosp_tau: 7, cosp_ht: 40,
                          cosp_sr: 15, cosp_htmisr: 16, cosp_tau_modis: 7,
                          cosp_reffice: 6, cosp_reffliq: 6, cosp_sza: 5,
                          cosp_scol: 10)
Coordinates: (12/13)
  * lev                  (lev) float64 576B 0.1238 0.1828 0.2699 ... 993.8 998.5
  * ilev                 (ilev) float64 584B 0.1 0.1477 0.218 ... 997.0 1e+03
  * cosp_prs             (cosp_prs) float64 56B 9e+04 7.4e+04 ... 2.45e+04 9e+03
  * cosp_tau             (cosp_tau) float64 56B 0.15 0.8 2.45 ... 41.5 100.0
  * cosp_scol            (cosp_scol) int32 40B 1 2 3 4 5 6 7 8 9 10
  * cosp_ht              (cosp_ht) float64 320B 1.896e+04 1.848e+04 ... 240.0
    ...                   ...
  * cosp_sza             (cosp_sza) float64 40B 0.0 20.0 40.0 60.0 80.0
  * cosp_htmisr          (cosp_htmisr) float64 128B 0.0 250.0 ... 1.8e+04
  * cosp_tau_modis       (cosp_tau_modis) float64 56B 0.15 0.8 ... 41.5 100.0
  * cosp_reffice         (cosp_reffice) float64 48B 5e-06 1.5e-05 ... 7.5e-05
  * cosp_reffliq         (cosp_reffliq) float64 48B 4e-06 9e-06 ... 2.5e-05
  * time                 (time) object 576B 0001-02-01 00:00:00 ... 0007-01-0...
Dimensions without coordinates: n_face, nbnd
Data variables: (12/471)
    lat                  (time, n_face) float64 12MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    lon                  (time, n_face) float64 12MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    area                 (time, n_face) float64 12MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    hyam                 (time, lev) float64 41kB dask.array<chunksize=(1, 4), meta=np.ndarray>
    hybm                 (time, lev) float64 41kB dask.array<chunksize=(1, 4), meta=np.ndarray>
    P0                   (time) float64 576B 1e+05 1e+05 1e+05 ... 1e+05 1e+05
    ...                   ...
    soa_c1DDF            (time, n_face) float32 6MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    soa_c1SFWET          (time, n_face) float32 6MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    soa_c2DDF            (time, n_face) float32 6MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    soa_c2SFWET          (time, n_face) float32 6MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    soa_c3DDF            (time, n_face) float32 6MB dask.array<chunksize=(1, 21600), meta=np.ndarray>
    soa_c3SFWET          (time, n_face) float32 6MB dask.array<chunksize=(1, 21600), meta=np.ndarray>

Cleanup#

Always remember to shut down the Dask cluster when you’re done!

client.shutdown()