Sign up for a TACC user account here: https://portal.tacc.utexas.edu/. After your account has been approved, enable Multi-factor Authentication by following the instructions here.
Individual users have been added to the DARPA-SHADE project. You can view a list of projects associated with your account by logging into the portal and selecting 'Projects and Allocations' from the 'Allocations' drop-down in the main menu bar. For most users who have not used TACC resources, you will only see the DARPA-SHADE project. Selecting the DARPA-SHADE project will provide additional details, including a list of resource allocations and users associated with the project. Current compute allocations include Frontera, Longhorn, and Lonestar6. In addition to the 1TB of user/account-specific storage on the shared, Lustre filesystem, Stockyard (mounted on Frontera, Longhorn, and Lonestar6), we have also provisioned storage on Corral where you can share data within and among teams. Below we will discuss access and basic usage.
Secure shell ("ssh" command) is the standard way to connect to the login nodes on TACC systems. To initiate a session:
#Frontera
localhost$ ssh username@frontera.tacc.utexas.edu
#Lonestar6
localhost$ ssh username@ls6.tacc.utexas.edu
#Longhorn
localhost$ ssh username@longhorn.tacc.utexas.edu
You will be prompted to enter your password followed by the multi-factor authentication (MFA) code. Welcome text, project/allocation information, and disk quotas will be presented upon successful login. By default, you will be located in your home directory (echo $HOME
or pwd
to see path), which is specific to each machine (i.e. it is not shared across systems). Login nodes are shared resources and must accommodate hundreds of other user simultaneously. They are intended to be used for editing, compiling code, submiting jobs, file transfer, among other low impact tasks. Please do not run applications on the login nodes.
Each machine contains 3 standard file systems. Environment variables $HOME
, $WORK
and $SCRATCH
store the paths to directories that you own on each filesystem.
File System | Quota | Key Features |
---|---|---|
$HOME | Frontera: 25GB LS6: 10GB Longhorn: 10GB |
Not intended for parallel or high-intensity file operations. Best for cron jobs, small scripts, envrionment settings. Backed up regularly, not purged. |
$STOCKYARD | 1TB across all TACC systems | Global Shared Filesystem Not intended for high-intesnity file operations or jobs involving very large files. Good for storing software installations, original datasets, job scripts and templates. Not backed up, not purged. |
$SCRATCH | no quota | All job I/O activity, temporary storage. Files that have not been accessed in 10 or more days are subject to purge. |
Note: different systems offer additional resources. For example, Frontera includes two additonal file systems, $SCRATCH2 and $SCRATCH3, which can accommodate intensive parallel I/O operations. Consult user guides for additional details.
Stockyard is the Global Shared File System at TACC. It is mounted on all major TACC clusters. The $STOCKYARD
environment variable points to the highest level directory that you own on the shared file system. This variable is consistent across all TACC resources that mount Stockyard. The $WORK
environment variable, on the other hand, is resource specific and varies across systems. $WORK
is a subdirectory of $STOCKYARD
.
/work/12345/bjones/ #$STOCKYARD on all systems
|
|---> /frontera #$WORK on frontera
|
|---> /lonestar6 #$WORK on LS6
|
|---> /longhorn #$WORK on longhorn
For Linux-based systems, scp
or rsync
can be used to transfer files to TACC systems. Windows SSH clients typically include scp
-based file transfer capabilities. To transfer a file to your home directory on Frontera:
localhost$ scp myfile.txt jadrake@frontera.tacc.utexas.edu:
To transfer a file directly to your work directory on LS6, first retrieve the path to your work directory:
login1.ls6(11)$ echo $WORK
/work/01262/jadrake/ls6
then use the path to upload file:
localhost$ scp myfile.txt jadrake@ls6.tacc.utexas.edu:/work/01262/jadrake/ls6
Lmod is a module system developed and maintained at TACC that makes it easy to manage your environment so you have access to software packages. Loading a module amounts to choosing a specific package by defining or modifying environment variables. To list available modules run the following on the login node:
login1.ls6(14)$ module av
To load a specific module:
login1.ls6(17)$ module load python3/3.9.7
To see which modules are currently loaded:
login1.ls6(18)$ module list
Batch Mode TACC systems run a job scheduler, Slurm WWorkload Manager, which provides commands to submit, manage, monitor, and control jobs. Jobs submitted to the scheduler are queued, then run on the compute nodes. TACC does not implement node-sharing. Your queue wait times will be less if you request only the time you need: the scheduler will have a much easier time finding a slot or the 2 hours you really need than, for example, if you requested 12 hours in your job script. Specific details, like the names of each queue and job request limits, can be found on each machine's user guide. Below we provide an overview of important concepts.
Jobs are are submitted from the login node. To submit a batch job (i.e. an unattended job), use the command sbatch
followed by your job script (discussed below).
login1.ls6(20)$ sbatch myjobscript
Until your batch job begins it will wait in the queue. You do not need to remain connected while the job is waiting or executing.
Interactive Mode
An interactive session can be launched on a compute node using idev
, functionality developed at TACC which submits a batch script requesting access to a compute node, after which the user is automatically ssh'd to that specific node. To launch a thirty-minute interactive session on a single node in the development
queue:
login1.ls6(20)$ idev
To launch an interactive job on the normal queue, with 2 nodes, for 120 minutes:
login1.ls6(20)$ idev -p normal -N 2 -m 120
Job Scripts
Slurm's sbatch
command is used to submit a batch job script. #SBATCH directives are used within the script to specifiy a number of parameters/options. The user guides provide various example job scripts. Basic example is provided below.
Example: serial job, small queue, on Frontera.
#!/bin/bash
#----------------------------------------------------
# Sample Slurm job script
# for TACC Frontera CLX nodes
#
# *** Serial Job in Small Queue***
#
# Last revised: 22 June 2021
#
# Notes:
#
# -- Copy/edit this script as desired. Launch by executing
# "sbatch clx.serial.slurm" on a Frontera login node.
#
# -- Serial codes run on a single node (upper case N = 1).
# A serial code ignores the value of lower case n,
# but slurm needs a plausible value to schedule the job.
#
# -- Use TACC's launcher utility to run multiple serial
# executables at the same time, execute "module load launcher"
# followed by "module help launcher".
#----------------------------------------------------
#SBATCH -J myjob # Job name
#SBATCH -o myjob.o%j # Name of stdout output file
#SBATCH -e myjob.e%j # Name of stderr error file
#SBATCH -p small # Queue (partition) name
#SBATCH -N 1 # Total # of nodes (must be 1 for serial)
#SBATCH -n 1 # Total # of mpi tasks (should be 1 for serial)
#SBATCH -t 01:30:00 # Run time (hh:mm:ss)
#SBATCH --mail-type=all # Send email at begin and end of job
#SBATCH -A myproject # Project/Allocation name (req'd if you have more than 1)
#SBATCH --mail-user=username@tacc.utexas.edu
# Any other commands must follow all #SBATCH directives...
module list
pwd
date
# Launch serial code...
# python3 myscript.py
./mycode.exe # Do not use ibrun or any other MPI launcher
Slurm's squeue
command allows you to monitor jobs in the queues, whether pending or running:
login1$ squeue # show all jobs in all queues
login1$ squeue -u bjones # show all jobs owned by bjones
login1$ man squeue # more info
Excerpt from the default output of squeue
command:
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
25618 normal SP256U connor PD 0:00 1 (Dependency)
25944 normal MoTi_hi wchung R 35:13 1 c112-203
25945 normal WTi_hi_e wchung R 27:11 1 c113-131
25606 normal trainA jackhu R 23:28:28 1 c119-152
To cancel a job you submitted, use squeue
to find the JOBID. Then use scancel JOBID
to cancel the job. Use squeue
again to confirm that the job was successfully terminated.