Dask on HPC

Slidesarrow-up-right, Credit to Dask Community arrow-up-right

Dask is a parallel computing library that scales Python code from a single laptop to a cluster. It's useful for:

  • Handling big data that doesn’t fit in memory.

  • Speeding up Pandas, NumPy, and scikit-learn workloads.

  • Running tasks concurrently using graphs.


Run Locally

1. Environment Setup

Requirements

  • Python 3.8+

  • pip / conda

  • Git (optional, for cloning examples)

📥 Install Dask (CPU only)

Or if you're using conda:


2. Try Basic Dask Examples

Run these as python files or in a jupyter notebook.

Use Dask Dashboard (Optional, Very Useful)

Run your script with the distributed scheduler to get the dashboard.

  • Open the printed URL in your browser (e.g. http://127.0.0.1:8787).

  • See live task execution, memory usage, worker status, etc.

a) Dask DataFrame (Parallel Pandas)

b) Dask Array (Parallel NumPy)

c) Dask Delayed (Manual Task Graphs)

d) Speed Comparison


Pre made Demos

5. Running Example Scripts

Clone the official Dask examples:

Run an example notebook or script:

Look in folders like dataframe/, array/, delayed/, or distributed/ for ready-to-run demos.

Running container on TAMU Faster

Authorized ACCESS users can log in using the Web Portal:

Go to Cluster -> Shell Access

on the shell:

dask examples docs: https://examples.dask.org/arrow-up-right

On the dashboard go -> Interactive Apps -> Jupyter notebook


Extra conda steups

1. Create a Conda Environment

2. Activate the Environment

3. Install Dask (Core + Scheduler)

4. (Optional) Install Common Dependencies


Adding custom packages to tamu jupyter notebook:

To create an Anaconda conda environment called my_notebook (you can name it whatever you like), do the following on the command line:

After your my_notebook environment is created, you will see output on how to activate and use your my_notebook environment

Then you need to install notebook and then you can add optional packages to your my_notebook environment

You can use your Anaconda/ environment in the Jupyter Notebook portal app by selecting the Anaconda/ module in the portal app page and providing just the name (without the full path) of your Anaconda/ environment in the "Optional Environment to be activated" box. In the example above, the value to enter is: my_notebook

Last updated