Dask on HPC
Slides, Credit to Dask Community
What is Dask?
Dask is a parallel computing library that scales Python code from a single laptop to a cluster. It's useful for:
Handling big data that doesn’t fit in memory.
Speeding up Pandas, NumPy, and scikit-learn workloads.
Running tasks concurrently using graphs.
Run Locally
1. Environment Setup
Requirements
Python 3.8+
pip / conda
Git (optional, for cloning examples)
Set up a virtual environment (optional, but recommended)
python -m venv dask-env
source dask-env/bin/activate # For Windows, use: dask-env\Scripts\activate
📥 Install Dask (CPU only)
pip install dask[complete] # includes arrays, dataframes, diagnostics, etc.
Or if you're using conda:
conda install dask distributed -c conda-forge
2. Try Basic Dask Examples
Run these as python files or in a jupyter notebook.
pip install jupyterlab
jupyter lab
Use Dask Dashboard (Optional, Very Useful)
Run your script with the distributed scheduler to get the dashboard.
from dask.distributed import Client
client = Client() # Starts a local cluster
print(client.dashboard_link)
Open the printed URL in your browser (e.g. http://127.0.0.1:8787).
See live task execution, memory usage, worker status, etc.
a) Dask DataFrame (Parallel Pandas)
import dask.dataframe as dd
!pip install requests aiohttp
# Load a large CSV in chunks
df = dd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
# Operations are lazy: nothing happens until compute()
print(df.head()) # Triggers computation
print(df.total_bill.mean().compute()) # Mean of a column
b) Dask Array (Parallel NumPy)
import dask.array as da
# Create a large array (10000 x 10000) split into 1000x1000 blocks
x = da.random.random((10000, 10000), chunks=(1000, 1000))
# Compute the mean
mean = x.mean()
print(mean.compute()) # Triggers computation
c) Dask Delayed (Manual Task Graphs)
from dask import delayed
import time
def slow_add(x, y):
time.sleep(1)
return x + y
# Wrap with delayed
a = delayed(slow_add)(1, 2)
b = delayed(slow_add)(3, 4)
c = delayed(slow_add)(a, b)
# Nothing runs until compute
print(c.compute()) # Takes ~3 seconds (runs in parallel)
d) Speed Comparison
import time
from dask import delayed, compute
# --- Serial Version (Python) ---
def slow_square(x):
time.sleep(0.5) # Simulate work
return x * x
start = time.time()
results = [slow_square(x) for x in range(10)]
total_serial = sum(results)
print("Serial total:", total_serial)
print("Serial time: %.2f seconds" % (time.time() - start))
# --- Dask Parallel Version ---
@delayed
def slow_square_dask(x):
time.sleep(0.5)
return x * x
start = time.time()
tasks = [slow_square_dask(x) for x in range(10)]
total_parallel = delayed(sum)(tasks)
print("Parallel total:", total_parallel.compute())
print("Parallel time: %.2f seconds" % (time.time() - start))
Pre made Demos
5. Running Example Scripts
Clone the official Dask examples:
git clone https://github.com/dask/dask-examples.git
cd dask-examples
Run an example notebook or script:
pip install jupyterlab
jupyter lab
Look in folders like
dataframe/
,array/
,delayed/
, ordistributed/
for ready-to-run demos.
Running container on TAMU Faster
Authorized ACCESS users can log in using the Web Portal:
Go to Cluster -> Shell Access
on the shell:
cd $SCRATCH
git clone https://github.com/dask/dask-examples/tree/main
dask examples docs: https://examples.dask.org/
On the dashboard go -> Interactive Apps -> Jupyter notebook
Extra conda steups
Conda Setup for Dask (Recommended)
1. Create a Conda Environment
bashCopyEditconda create -n dask-env python=3.10 -y
2. Activate the Environment
bashCopyEditconda activate dask-env
3. Install Dask (Core + Scheduler)
bashCopyEditconda install -c conda-forge dask distributed -y
4. (Optional) Install Common Dependencies
bashCopyEditconda install -c conda-forge pandas numpy jupyterlab matplotlib scikit-learn pyarrow fastparquet -y
Adding custom packages to tamu jupyter notebook:
To create an Anaconda conda environment called my_notebook (you can name it whatever you like), do the following on the command line:
module purge
module load Anaconda3/2022.10
conda create -n my_notebook
After your my_notebook environment is created, you will see output on how to activate and use your my_notebook environment
#
# To activate this environment, use:
# > source activate my_notebook
#
# To deactivate an active environment, use:
# > source deactivate
#
Then you need to install notebook and then you can add optional packages to your my_notebook environment
source activate my_notebook
conda install -c conda-forge notebook
conda install -c conda-forge <optional-packages>
You can use your Anaconda/ environment in the Jupyter Notebook portal app by selecting the Anaconda/ module in the portal app page and providing just the name (without the full path) of your Anaconda/ environment in the "Optional Environment to be activated" box. In the example above, the value to enter is: my_notebook
Last updated