Overview

UC San Diego's Data Science/Machine Learning Platform (DSMLP) provides undergraduate and graduate students with access to research-class CPU/GPU resources for coursework, formal independent study, and student projects.

Built and operated by IT Services (ITS), with additional financial contributions from Cognitive Science and Jacobs School of Engineering, DSMLP leverages Qualcomm Institute's current research into cost-effective machine-learning cyberinfrastructure using Kubernetes and Docker container technologies.

Please note, DSMLP is not intended to store personally identifiable information (PII) or other sensitive data. It is also against policy to manually run code without the launch.sh script. For more info, see Policies below.

How to get help

For questions about a course, such as a problem with a homework assignment, please contact your TA or instructor.

To get help with DataHub or related systems, contact the ITS Service Desk at datahub@ucsd.edu to create a case. Please include the following information:

The course code, e.g. COGS 108.
Which system you are having the problem with, e.g. DataHub or dsmlp-login.
The container, e.g, scipy-ml-notebook
Any commands you are running, e.g. launch.sh
Screenshots of the problem

You may also reach the ITS Service Desk by phone/walk-in. However, it is better to create a case online.

There are also several self-help documents in the Specialized Instructional Computing Knowledge Base.

Accessing Datahub

Why can't I log in?

Log out of all Google accounts or open an incognito window. When prompted, enter your full UCSD email, "username@ucsd.edu", as your credentials. Currently only '@ucsd.edu' addresses are accepted in the Data Science and Machine Learning Platform, not departmental or divisional addresses such as '@eng.ucsd.edu' or '@physics.ucsd.edu'

Access is restricted to authorized users. Students may obtain access to DSMLP through the following methods:

By automatic setup of courses assigned to DSMLP: enrolled students receive access shortly before the first day of each term.
Via the DSMLP Independent Study Access Request Form

UC San Diego Extension students enrolled in a DSMLP/datahub course should fill out the Concurrent Enrollment Account form and ask their instructor to add them to the course via Canvas.

For more information on eligibility, visit the DSMLP page on blink.ucsd.edu.

Note that we perform required weekly patching on datahub/DSMLP every Wednesday from 6-8 AM Pacific Time. This may result in reduced capacity, and depending on the nature of the update, the service as a whole may be inaccessible.

How do I get to my course environment?

After logging in, make sure you are selecting the correct container for your course on the "Select your (Course) Environment" page. The course name will be identified on the container. Some courses have multiple containers available; consult with your TA/instructor for guidance.

After launching your container, you will see the Jupyter notebook server environment. To open a new Jupyter notebook, select "New" (top right of page), then "Python 3".

When your work is complete, please shut down your container via "Control Panel" (top right), then click the Stop my Server button.

Why isn't my course environment listed after I log in?

If you recently registered in a course, please allow 1-2 business days for resources to be listed. Otherwise, please file a IT Services support ticket, include reasons for access, what course you are registered in, and whether you are auditing the course. If auditing a course, please get the instructor's permission prior to submitting a request.

If you are no longer enrolled in any courses, please see "How long does access last after the end of term?" below on how to retrieve your files when you can no longer access your course environment.

Why won't my container start? ("Spawn failed when starting server" error)

If your container is failing to load, please check for these two issues before submitting a ticket to the DataHub team.

Run Manual Resetter

Please reset your profile:

Navigate to datahub.ucsd.edu
Click on the "services" dropdown
Select "manual-resetter", and click on the reset button. This will stop any servers you are running, log you out, and reset your profile, while preserving all work/files.

.local Directory Issues

Some Python packages installed under .local/lib/python3.x/site-packages may be preventing your notebook from working correctly. If these files exist, moving them may fix your container.

You can "ssh USERNAME@dsmlp-login.ucsd.edu" to manage the files in your notebook. Run 'workspace -c COURSE_ID' open your course workspace. (You can run 'workspace -l' to list your available workplaces). Once there you can relocate the files using "mv .local/lib .local/lib.old" or delete them.

If you do need to install custom pip packages, we recommended that you use a virtual environment. Please refer to the "How do I install Python packages from my own virtual environment on datahub.ucsd.edu?" section.

Disk Quota Exceeded

You can check your quota via on DataHub in the navigation bar: Services->disk-quota-service. You may also go to https://datahub.ucsd.edu/services/disk-quota-service.

Check if you have received a "disk quota exceeded" email. If so, please see the disk space quota section below.

I'm getting a 504 error

A 504 error can occur when the pod crashes, for example, after running an infinite loop or running out of memory. Eventually JupyterHub detects the pod crashed and resets the environment. To speed up the process you can terminate the pod on the dsmlp-login server.

To delete the pod run these commands from a terminal:

ssh username@dsmlp-login.ucsd.edu
kubectl get pod
NAME                  READY   STATUS    RESTARTS   AGE
dsmlp-jupyter-username   1/1     Running   0          124m
kubectl delete pod dsmlp-jupyter-username

After the pod is deleted return to datahub.ucsd.edu and run Service -> manual-resetter.

Files and Quotas

Users have two storage pools and each pool has a separate quota. One pool is for the course workspace, and the other pool is for personal use (/private). Files in the /public and /teams folders count against the workspace quota. Files in /private count against the personal quota.

Files	Path in Jupyter notebook UI	Notebook Location	Quota
Course home directory	/	/home/USERNAME	Workspace
Course team directory	/teams	/home/USERNAME/teams/TEAM	Workspace
Course public directory	/public	/home/USERNAME/public	Workspace
Personal directory	/private	/home/USERNAME/private	Personal

What datasets are available?

Course datasets typically reside in the public directory.

There are are also several large datasets available for use with DSMLP/Datahub. To access these datasets from datahub.ucsd.edu, launch your environment, select “New->Terminal” at the top right of the notebook server interface, and enter: cd /datasets. To access these datasets from the command line on dsmlp-login, ssh to dsmlp-login.ucsd.edu and use the previous command to access the datasets directory.

For more information on what datasets we have available, see: https://datahub.ucsd.edu/hub/datasets

How do I upload a large file or files?

See: How to: File/Data Transfer - Data Science/Machine Learning Platform (DSMLP)

How do I check my disk quota?

You can check your quota with the Disk Quota Service. Access the Disk Quota Service on DataHub under the Services menu.

How do I delete files?

Launch a notebook for the course. Delete some files that are not in the private folder. Files in the private folder are personal files and don't count against the course quota.

Note: When you delete a file with the Jupyter UI it goes into the trash folder at .local/share/Trash. These files can accumulate. To manually delete them, open a terminal and cd to that directory and delete the files using rm. Files should get automatically deleted out of Trash after 7 days.

How do I delete files if my notebook won't launch?

There is an alternate way to access DataHub files without using DataHub. This is useful when the notebook doesn't load because the disk is full.

ssh USERNAME@dsmlp-login.ucsd.edu

The course ID is needed to access the files. The course ID is displayed in Disk Quota Service. Or you can find it with the workspace --list command.

[username@dsmlp-login ~]$ workspace --list
2023-03-01 11:51:58,533 - workspace - INFO - Retrieving course info...
Course_ID, Path to Course Workspace Home Directory
--------------------------------------------------
COURSE_ID /dsmlp/workspaces-fs04/COURSE_ID/home/username

Use the workspace -c command to go to your course home directory.

workspace -c COURSE_ID

Inside the course directory you can list the files with the ls command and delete files with the rm command. Use the du (disk usage) command to see which directories use the most space, e.g. du -h --max-depth 1.

Once the disk usage drops below the quota notebooks should launch again on DataHub.

How do I clear my personal quota?

The personal directory is what appears in the "private" folder on DataHub. These are files that belong to you, not the course.

Please refer to the section "How do I delete files if my notebook won't launch?".

There is no need to run the workspace command dsmlp-login. The home directory is your personal directory.

Uploading Large/Bulk Files

Large input datasets should reside in a single location rather than being downloaded to each student's home directory. Students who require more storage should ask their instructor (or TA) to submit a request via the instructors' course Service Desk ticket. For Independent Study students (not enrolled in a datahub course) who require more storage, please submit a Service Desk ticket.

If you're uploading large files (> 64MB) or a lot of files, then uploading them with the Jupyter UI may not be ideal.

One option is to ZIP the files, upload the ZIP, and then extract the ZIP. Or you could put the files into git and do a git pull.

Another option is to use scp to transfer the files. To do this ssh to dsmlp-login. This is your personal directory. You will notice that this path name is different than what you see inside a course notebook.

You can use scp to copy the files here from your own pc to datahub, or backup files from datahub. Type "pwd" to show the path to your home directory.

Uploading files

scp /local/path username@dsmlp-login.ucsd.edu:~/remote/path

Downloading files

scp username@dsmlp-login.ucsd.edu:~/remote/path /local/path

If you need to copy files to your course home directory run "workspace --list" to show the directory they are located in. It should be something like /dsmlp/fs0x-workspaces/COURSE/home/USERNAME. The directory may be invisible until you cd to it. Team files are located under the same path, but with /teams instead of home, e.g. /dsmlp/fs0x-workspaces/COURSE/teams/TEAM. Use this path in scp instead of your personal home.

General Questions/Errors/Problems

How can I fix "GPU quota exceeded"?

The message "GPU quota exceeded. Wanted 1 but with 1 already in use, the quota of 1 would be exceeded" means that the pod couldn't launch because there is already a pod running with a GPU. To launch a new pod with a GPU the old one must be stopped. Usually the old pod crashed and will stop in a minute or two.

If this doesn't happen, the kubectl command can stop the old pod. Run the ssh command in a terminal on your desktop PC to connect to dsmlp-login using your AD username and password.

ssh USERNAME@dsmlp-login.ucsd.edu
kubectl get pod
kubectl delete pod [pod ID]

Why am I receiving an out of capacity error, and/or being evicted?

Server capacity is limited, and during peak times (e.g. 10th week/finals) occasional delays are to be expected. GPU resources are the most constrained, so you may have success re-launching your job as CPU-only. If you experience capacity errors for prolonged periods, or at unexpected times of the quarter, please file a Service Desk ticket. ETS staff will respond within one business day.

DSMLP is primarily an instructional resource. Users enrolled in a course have higher priority than independent study and research users. When capacity is exhausted, or course users require resources, you may receive an eviction message notifying you to save your work and exit within ten minutes. Temporary hardware or systems issues can also reduce capacity and result in evictions.

During busy periods, we encourage you to visit the cluster status page to check resource availability.

How long does access last after the end of term?

Access is retained for one additional term, e.g. a fall course is available until the end of winter. Please submit an independent study request (https://go.ucsd.edu/2wc5gH0) to request an extension.

To retrieve your files, please scp them from dsmlp-login.ucsd.edu.

scp -r <username>@dsmlp-login.ucsd.edu:/directory/to/send /local/where/to/put

What are the default resource limits?

Each user has a predefined amount GPU/CPU/RAM based on their enrolled course. To see the available RAM, look at the upper right corner of your notebook server (after opening a notebook). Please be considerate and terminate any unused background jobs, since GPU cards are assigned to containers on an exclusive basis, and when attached to a container are unusable by others even if idle.

If available for your course, we encourage you to use non-GPU (CPU-only) containers until your code is fully tested and a simple training run is successful. (PyTorch, Tensorflow, and Caffe toolkits can easily switch between CPU and GPU.)

GPU types are listed on the cluster status page.

Are groups available?

Groups of users can be created in DSMLP/datahub for sharing datasets, code, etc. Your TA or instructor can set this up in Canvas using the instructions here: https://support.ucsd.edu/services?id=kb_article_view&sysparm_article=KB0030588. Groups will have 100 GB of storage by default; please email datahub@ucsd.edu to request additional space.

How do I kill / stop a notebook?

Use the control panel (button top right of the Jupyter server) to stop the notebook, there may be a delay in updating interface, please let the process complete. Logging Out Does NOT Stop the Server (but it may stop due to inactivity).

How can I run a long-running job?

If you have a job that needs to run for an extended period of time, we recommend running from the command line. See "How do I launch a container from the command line" below. If you are running from within datahub.ucsd.edu, you must keep your browser window open. More information: https://zero-to-jupyterhub.readthedocs.io/en/stable/jupyterhub/customizing/user-management.html

How do I customize, fix or reset my environment?

Please see How To: Customize your environment in DSMLP/Datahub (including jupyter notebooks)

What additional kernels/software are available in my datahub course container?

If your course container has additional software versions available, these can be accessed via the "New" button at the top right of your notebook server interface in datahub.ucsd.edu. Additional kernels containing those software versions will appear in the dropdown menu below the standard python notebook option.

How do I launch a container from the command line (outside datahub.ucsd.edu)?

See "Launching Containers from the Command Line - DSMLP" and "How to Select and Configure Your Container - DSMLP"

What version of cuda is installed in the machine learning notebook container?

See the current installed version of cuda by referring to the "cuda install cudatoolkit" line in our scipy-ml-notebook container here: https://github.com/ucsd-ets/scipy-ml-notebook/blob/master/Dockerfile

This version may be behind the newest cuda release(s). This is because upgrading cuda requires coordinating an update of its drivers and various toolkits. Therefore we typically perform the update process during the summer academic terms.

For example, in Summer 2021, current cuda version = 10.1, updated to cuda 11+ for Fall 2021 term.

Why am I receiving the error "RuntimeError: CUDA out of memory"?

Each GPU has a limited amount of physical memory available to it, independent of the amount of RAM available to your pod. This message is seen when you exhaust that memory, and indicates that you need to work with your TA to modify how much GPU memory your work requires.

The GPUs on different compute nodes in the DataHub cluster have different amounts of memory available. Most nodes have 11GB, but others may have more or less. To view the capacities of the GPU attached to your current pod, run the command "nvidia-smi". (Note that you will not be able to see the list of processes running on the GPU from your pod.)

The amount of free memory can also be determined programmatically by using the cudaMemGetInfo() function in the CUDA Runtime API.

Subtle choices in PyTorch library usage can result in memory leaks that exhaust GPU memory. For suggestions on how to manage CUDA memory in PyTorch, see the section My model reports "cuda runtime error(2): out of memory" on the PyTorch FAQ.

Tensorflow's default behavior is to claim all available RAM on the GPU, and not let go of it until the process that claimed it (typically a notebook kernel for DSMLP) dies. If you want to reclaim the memory (incase you would like to run another process in a different notebook or swap from Tensorflow to PyTorch in the same notebook, you may restart the kernel. You may also run this command from Tensorflow that dynamically claims memory as the kernel needs it. More info can be found on the Tensorflow GPU Guide.

# Prevent TF from using all available NVRAM...
import tensorflow as tf
gpus = tf.config.list_physical_devices('GPU')
for gpu in gpus:
  tf.config.experimental.set_memory_growth(gpu, True)

What if I get a database locked error message?

1. Open New -> Terminal
2. rm ~/.local/share/jupyter/nbsignatures.db
3. Close terminal window
4. Control Panel -> Stop
5. Restart notebook

How do I launch my first RStudio session?

Information on using R and RStudio with DSMLP/Datahub can be found here.

Where is the "classic" Jupyter Notebook user interface?

As of our 2024.4 set of docker images, we default to the JupyterLab UI. If you want to switch your UI to Notebook 7/Nbclassic, open a notebook (.ipynb) and click on the Open In button. You can also replace /lab in the URL with /tree to access Notebook 7, e.g., https://datahub.ucsd.edu/user/<username>/tree.

I'm getting a "failed to validate" or "source of the following cell has changed" error/I can't autograde a particular submission due to "corrupt metadata"

This occurs if a read-only cell/an autograded cell has been copied. Do not copy and paste these cells in your notebook. Similar errors, such as "Failed validating 'required' in notebook" might occur if you edit or delete cells provided by the instructor. To resolve this problem, you (or your student) will need to:

Back up your assignment: Rename the existing notebook by adding '-corrupted' to the end of the notebook filename, and download a copy to your computer in case you need to start over.
Re-fetch your notebook from the 'Assignments' tab. You will now see both the fresh copy and the corrupted copy in your Files.
Open both copies of the notebook, and copy your answers from the corrupted notebook to the fresh copy
Re-submit your assignment

Policies

Sensitive Data

UC Data Protection Levels are defined at: https://security.ucop.edu/files/documents/uc-protection-level-classification-guide.pdf

DSMLP is not suitable for storage or processing of Category P3/P4 data, which includes:

government classified/controlled (CUI/CTI/ITAR/FISMA)
health or personal protected information (PHI/HIPAA, IRB controlled, or statutory PII)
other students' grades or academic records (FERPA)
information subject to certain Data Use Agreements (DUA)

Please visit the link above for a full list of P3/P4 data.

If your project may involve P3/P4 data, please contact ETS or Research IT Services for a consultation.

Running jobs on dsmlp-login

DSMLP is primarily designed to facilitate the execution of docker images using the launch.sh command. By running launch.sh, dedicated nodes are provisioned, allowing the utilization of these images to create customized environments suitable for various development workflows. It is important to note that dsmlp-login itself is prohibited for manual job execution, such as running Python scripts, Java projects, or machine learning tasks. Developing within these dedicated nodes, instead of using dsmlp-login, can help minimize any potential impact on server performance.

If you still have questions or need additional assistance, please email datahub@ucsd.edu or visit support.ucsd.edu.

datahub.ucsd.edu and Data Science/Machine Learning Platform FAQ