Dask on HPC

Slides, Credit to Dask Community

What is Dask?

Dask is a parallel computing library that scales Python code from a single laptop to a cluster. It's useful for:

Handling big data that doesn’t fit in memory.
Speeding up Pandas, NumPy, and scikit-learn workloads.
Running tasks concurrently using graphs.

Run Locally

1. Environment Setup

Requirements

Python 3.8+
pip / conda
Git (optional, for cloning examples)

Set up a virtual environment (optional, but recommended)

python -m venv dask-env
source dask-env/bin/activate  # For Windows, use: dask-env\Scripts\activate

📥 Install Dask (CPU only)

pip install dask[complete]  # includes arrays, dataframes, diagnostics, etc.

Or if you're using conda:

conda install dask distributed -c conda-forge

2. Try Basic Dask Examples

Run these as python files or in a jupyter notebook.

pip install jupyterlab

jupyter lab

Use Dask Dashboard (Optional, Very Useful)

Run your script with the distributed scheduler to get the dashboard.

from dask.distributed import Client

client = Client()  # Starts a local cluster
print(client.dashboard_link)

Open the printed URL in your browser (e.g. http://127.0.0.1:8787).
See live task execution, memory usage, worker status, etc.

a) Dask DataFrame (Parallel Pandas)

import dask.dataframe as dd
!pip install requests aiohttp

# Load a large CSV in chunks
df = dd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

# Operations are lazy: nothing happens until compute()
print(df.head())             # Triggers computation
print(df.total_bill.mean().compute())  # Mean of a column

b) Dask Array (Parallel NumPy)

import dask.array as da

# Create a large array (10000 x 10000) split into 1000x1000 blocks
x = da.random.random((10000, 10000), chunks=(1000, 1000))

# Compute the mean
mean = x.mean()
print(mean.compute())  # Triggers computation

c) Dask Delayed (Manual Task Graphs)

from dask import delayed
import time

def slow_add(x, y):
    time.sleep(1)
    return x + y

# Wrap with delayed
a = delayed(slow_add)(1, 2)
b = delayed(slow_add)(3, 4)
c = delayed(slow_add)(a, b)

# Nothing runs until compute
print(c.compute())  # Takes ~3 seconds (runs in parallel)

d) Speed Comparison

import time
from dask import delayed, compute

# --- Serial Version (Python) ---
def slow_square(x):
    time.sleep(0.5)  # Simulate work
    return x * x

start = time.time()
results = [slow_square(x) for x in range(10)]
total_serial = sum(results)
print("Serial total:", total_serial)
print("Serial time: %.2f seconds" % (time.time() - start))


# --- Dask Parallel Version ---
@delayed
def slow_square_dask(x):
    time.sleep(0.5)
    return x * x

start = time.time()
tasks = [slow_square_dask(x) for x in range(10)]
total_parallel = delayed(sum)(tasks)
print("Parallel total:", total_parallel.compute())
print("Parallel time: %.2f seconds" % (time.time() - start))

Pre made Demos

5. Running Example Scripts

Clone the official Dask examples:

git clone https://github.com/dask/dask-examples.git
cd dask-examples

Run an example notebook or script:

pip install jupyterlab

jupyter lab

Look in folders like dataframe/, array/, delayed/, or distributed/ for ready-to-run demos.

Running container on TAMU Faster

Authorized ACCESS users can log in using the Web Portal:

Welcome To The CILogon OpenID Connect Authorization Service

Go to Cluster -> Shell Access

on the shell:

cd $SCRATCH
git clone https://github.com/dask/dask-examples/tree/main

dask examples docs: https://examples.dask.org/

On the dashboard go -> Interactive Apps -> Jupyter notebook

Extra conda steups

Conda Setup for Dask (Recommended)

1. Create a Conda Environment

bashCopyEditconda create -n dask-env python=3.10 -y

2. Activate the Environment

bashCopyEditconda activate dask-env

3. Install Dask (Core + Scheduler)

bashCopyEditconda install -c conda-forge dask distributed -y

4. (Optional) Install Common Dependencies

bashCopyEditconda install -c conda-forge pandas numpy jupyterlab matplotlib scikit-learn pyarrow fastparquet -y

Adding custom packages to tamu jupyter notebook:

To create an Anaconda conda environment called my_notebook (you can name it whatever you like), do the following on the command line:

module purge
module load Anaconda3/2022.10
conda create -n my_notebook

After your my_notebook environment is created, you will see output on how to activate and use your my_notebook environment

#
# To activate this environment, use:
# > source activate my_notebook
#
# To deactivate an active environment, use:
# > source deactivate
#

Then you need to install notebook and then you can add optional packages to your my_notebook environment

source activate my_notebook
conda install -c conda-forge notebook
conda install -c conda-forge <optional-packages>

You can use your Anaconda/ environment in the Jupyter Notebook portal app by selecting the Anaconda/ module in the portal app page and providing just the name (without the full path) of your Anaconda/ environment in the "Optional Environment to be activated" box. In the example above, the value to enter is: my_notebook

PreviousContainers for HPC

Last updated 4 months ago