# Dask on HPC

[Slides](https://docs.google.com/presentation/d/1vU7mJ6W-9cQeHdSDeDG2RSaczJT9MhyEE7hxHfhttZM/edit?slide=id.g1bfac4f1f30_0_365#slide=id.g1bfac4f1f30_0_365), Credit to [Dask Community ](https://github.com/dask/community/issues/9)

### What is [Dask](https://www.dask.org/)?

Dask is a parallel computing library that scales Python code from a single laptop to a cluster. It's useful for:

* Handling **big data** that doesn’t fit in memory.
* Speeding up **Pandas**, **NumPy**, and **scikit-learn** workloads.
* Running **tasks concurrently** using graphs.

***

## Run Locally

### 1. Environment Setup

#### Requirements

* Python 3.8+
* pip / conda
* Git (optional, for cloning examples)

#### &#x20;Set up a virtual environment (optional, but recommended)

```bash
python -m venv dask-env
source dask-env/bin/activate  # For Windows, use: dask-env\Scripts\activate
```

#### 📥 Install Dask (CPU only)

```bash
pip install dask[complete]  # includes arrays, dataframes, diagnostics, etc.
```

Or if you're using conda:

```bash
conda install dask distributed -c conda-forge
```

***

### 2. Try Basic Dask Examples

Run these as python files or in a jupyter notebook.

```
pip install jupyterlab
```

```
jupyter lab
```

### &#x20;Use Dask Dashboard (Optional, Very Useful)

Run your script with the **distributed** scheduler to get the dashboard.

```python
from dask.distributed import Client

client = Client()  # Starts a local cluster
print(client.dashboard_link)
```

* Open the printed URL in your browser (e.g. <http://127.0.0.1:8787>).
* See live task execution, memory usage, worker status, etc.

####

#### a) Dask DataFrame (Parallel Pandas)

```python
import dask.dataframe as dd
!pip install requests aiohttp

# Load a large CSV in chunks
df = dd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')

# Operations are lazy: nothing happens until compute()
print(df.head())             # Triggers computation
print(df.total_bill.mean().compute())  # Mean of a column
```

#### &#x20;b) Dask Array (Parallel NumPy)

```python
import dask.array as da

# Create a large array (10000 x 10000) split into 1000x1000 blocks
x = da.random.random((10000, 10000), chunks=(1000, 1000))

# Compute the mean
mean = x.mean()
print(mean.compute())  # Triggers computation
```

#### &#x20;c) Dask Delayed (Manual Task Graphs)

```python
from dask import delayed
import time

def slow_add(x, y):
    time.sleep(1)
    return x + y

# Wrap with delayed
a = delayed(slow_add)(1, 2)
b = delayed(slow_add)(3, 4)
c = delayed(slow_add)(a, b)

# Nothing runs until compute
print(c.compute())  # Takes ~3 seconds (runs in parallel)

```

#### d) Speed Comparison

```python
import time
from dask import delayed, compute

# --- Serial Version (Python) ---
def slow_square(x):
    time.sleep(0.5)  # Simulate work
    return x * x

start = time.time()
results = [slow_square(x) for x in range(10)]
total_serial = sum(results)
print("Serial total:", total_serial)
print("Serial time: %.2f seconds" % (time.time() - start))


# --- Dask Parallel Version ---
@delayed
def slow_square_dask(x):
    time.sleep(0.5)
    return x * x

start = time.time()
tasks = [slow_square_dask(x) for x in range(10)]
total_parallel = delayed(sum)(tasks)
print("Parallel total:", total_parallel.compute())
print("Parallel time: %.2f seconds" % (time.time() - start))

```

***

## Pre made Demos

### 5. Running Example Scripts

Clone the official Dask examples:

```bash
git clone https://github.com/dask/dask-examples.git
cd dask-examples

```

Run an example notebook or script:

```
pip install jupyterlab
```

```
jupyter lab
```

> Look in folders like `dataframe/`, `array/`, `delayed/`, or `distributed/` for ready-to-run demos.

## Running container on TAMU Faster

Authorized ACCESS users can log in using the Web Portal:

{% embed url="<https://portal-faster-access.hprc.tamu.edu>" %}

Go to Cluster -> Shell Access

on the shell:

```
cd $SCRATCH
git clone https://github.com/dask/dask-examples/tree/main

```

dask examples docs:\
<https://examples.dask.org/>

On the dashboard go -> Interactive Apps -> Jupyter notebook\ <br>

![](/files/7ewBWwtre65TlHZ0OTYx)<br>

***

## Extra conda steups

### Conda Setup for Dask (Recommended)

#### 1. Create a Conda Environment

```bash
bashCopyEditconda create -n dask-env python=3.10 -y
```

#### 2. Activate the Environment

```bash
bashCopyEditconda activate dask-env
```

#### 3. Install Dask (Core + Scheduler)

```bash
bashCopyEditconda install -c conda-forge dask distributed -y
```

#### 4. (Optional) Install Common Dependencies

```bash
bashCopyEditconda install -c conda-forge pandas numpy jupyterlab matplotlib scikit-learn pyarrow fastparquet -y
```

####

***

### Adding custom packages to tamu jupyter notebook:

To create an Anaconda conda environment called my\_notebook (you can name it whatever you like), do the following on the command line:

```
module purge
module load Anaconda3/2022.10
conda create -n my_notebook
```

After your my\_notebook environment is created, you will see output on how to activate and use your my\_notebook environment

```
#
# To activate this environment, use:
# > source activate my_notebook
#
# To deactivate an active environment, use:
# > source deactivate
#
```

Then you need to install notebook and then you can add optional packages to your my\_notebook environment

```
source activate my_notebook
conda install -c conda-forge notebook
conda install -c conda-forge <optional-packages>
```

You can use your Anaconda/ environment in the Jupyter Notebook portal app by selecting the Anaconda/ module in the portal app page and providing just the name (without the full path) of your Anaconda/ environment in the "Optional Environment to be activated" box. In the example above, the value to enter is: **my\_notebook**


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://workshop.dukeieee.org/workshops/dask-on-hpc.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
