pip install dask[complete] # includes arrays, dataframes, diagnostics, etc.
Or if you're using conda:
conda install dask distributed -c conda-forge
2. Try Basic Dask Examples
Run these as python files or in a jupyter notebook.
pip install jupyterlab
jupyter lab
Use Dask Dashboard (Optional, Very Useful)
Run your script with the distributed scheduler to get the dashboard.
from dask.distributed import Client
client = Client() # Starts a local cluster
print(client.dashboard_link)
Open the printed URL in your browser (e.g. http://127.0.0.1:8787).
See live task execution, memory usage, worker status, etc.
a) Dask DataFrame (Parallel Pandas)
import dask.dataframe as dd
!pip install requests aiohttp
# Load a large CSV in chunks
df = dd.read_csv('https://raw.githubusercontent.com/mwaskom/seaborn-data/master/tips.csv')
# Operations are lazy: nothing happens until compute()
print(df.head()) # Triggers computation
print(df.total_bill.mean().compute()) # Mean of a column
b) Dask Array (Parallel NumPy)
import dask.array as da
# Create a large array (10000 x 10000) split into 1000x1000 blocks
x = da.random.random((10000, 10000), chunks=(1000, 1000))
# Compute the mean
mean = x.mean()
print(mean.compute()) # Triggers computation
c) Dask Delayed (Manual Task Graphs)
from dask import delayed
import time
def slow_add(x, y):
time.sleep(1)
return x + y
# Wrap with delayed
a = delayed(slow_add)(1, 2)
b = delayed(slow_add)(3, 4)
c = delayed(slow_add)(a, b)
# Nothing runs until compute
print(c.compute()) # Takes ~3 seconds (runs in parallel)
d) Speed Comparison
import time
from dask import delayed, compute
# --- Serial Version (Python) ---
def slow_square(x):
time.sleep(0.5) # Simulate work
return x * x
start = time.time()
results = [slow_square(x) for x in range(10)]
total_serial = sum(results)
print("Serial total:", total_serial)
print("Serial time: %.2f seconds" % (time.time() - start))
# --- Dask Parallel Version ---
@delayed
def slow_square_dask(x):
time.sleep(0.5)
return x * x
start = time.time()
tasks = [slow_square_dask(x) for x in range(10)]
total_parallel = delayed(sum)(tasks)
print("Parallel total:", total_parallel.compute())
print("Parallel time: %.2f seconds" % (time.time() - start))
Pre made Demos
5. Running Example Scripts
Clone the official Dask examples:
git clone https://github.com/dask/dask-examples.git
cd dask-examples
Run an example notebook or script:
pip install jupyterlab
jupyter lab
Look in folders like dataframe/, array/, delayed/, or distributed/ for ready-to-run demos.
Running container on TAMU Faster
Authorized ACCESS users can log in using the Web Portal:
Go to Cluster -> Shell Access
on the shell:
cd $SCRATCH
git clone https://github.com/dask/dask-examples/tree/main
On the dashboard go -> Interactive Apps -> Jupyter notebook
You can use your Anaconda/ environment in the Jupyter Notebook portal app by selecting the Anaconda/ module in the portal app page and providing just the name (without the full path) of your Anaconda/ environment in the "Optional Environment to be activated" box. In the example above, the value to enter is: my_notebook