Research Computing Frequently Asked Questions
- I can't ssh to machine anymore, getting a serious looking error that starts with
@ WARNING: REMOTE HOST IDENTIFICATION HAS CHANGED! @
- My calculations, or other file operations complain that the file can't be accessed, or it does not exist, even though I have just created or modified it.
- Starting Emacs editor is very slow.
- Opening Emacs file is very slow.
- Troubleshooting Slurm jobs that won't start (errors and other reasons)
- I would like to change my shell (to bash or tcsh)
- I would like to change the email address CHPC uses to contact me
- My program crashed because /tmp has filled up
- I am getting a message "Disk quota exceeded" when logging in
- How do I check that my job is running efficiently?
- My calculations are running slower than expected
- Jobs are running out of memory
- I am running a campus VPN on my home computer and can't connect to some websites
- I would like to unsubscribe from CHPC e-mail messages
- I can not log into CHPC with either host can not be reached or lockout message
While looking scary, this error is usually benign. It occurs when the SSH keys on the machine you are trying to connect to change, most commonly with operation system upgrade. There are two ways to get rid of this message and log in:
- open file ~/.ssh/known_hosts in a text editor and delete the lines that contain the host name you are connecting to
- use ssh-keygen command with -R flag to remove the ssh keys for the given host
e.g. ssh-keygen -R kingspeak1.chpc.utah.edu
On the subsequent ssh connection to the machine says something like the message below and let you login:
Warning: Permanently added 'astro02.astro.utah.edu,126.96.36.199 (ECDSA) to the list of known hosts
My calculations, or other file operations complain that the file can't be accessed, or it does not exist, even though I have just created or modified it.
This error may have many incarnations but it may look something like this:
ERROR on proc 0: Cannot open input script in.npt-218K-continue (../lammps.cpp:327)
It also occurs randomly, sometimes the program works, sometimes not.
This error is most likely due to the way how the file system writes files. For performance reasons, it writes parts of the file into a memory buffer, which gets periodically written to the disk. If another machine tries to access the file before the machine that writes the file writes it to the disk, this error occurs. For NFS, which we use for all our home directories and group spaces, it is well described here. There are several ways to deal with this:
- Use the Linux sync command to forcefully flush the buffers to the disk. Do this both at the machine where the file writing and file reading occurs BEFORE the file is accessed. To ensure that all compute nodes in the job sync, do "srun -n $SLURM_NNODES --ntasks-per-node=1 sync ".
- Sometimes adding the Linux sleep command can help, to provide extra time window for the syncing to occur.
- Inside of the code, use fflush for C/C++ or flush for Fortran. For other languages, such as Python and Matlab, google them for "flush" to see what options are there.
If neither of these help, please, try other file system to see if the error persists (e.g. /scratch/global/lustre, or /scratch/local), and let us know.
Emacs's initialization includes accessing many files, which can be slow in the network file system environment. The workaround is to run EMacs in the server mode (as a daemon), and start each terminal session using emacsclient command. The Emacs daemon stays in the background even if one disconnects from that particular system, so, it needs to be started only once per system start.
The easiest way is to create an alias for the emacs command as
alias emacs emacsclient -a \"\"
Note the escaped double quote characters (\"). This will start the emacs as a daemon if it's not started already, and proceeds to run in the client mode.
Note that by default emacsclient starts in the terminal, to force to start Emacs GUI, add "-c" flag, e.g. (assuming the aforementioned alias is in place) "emacs -c myfile.txt".
We have yet to find the root of this problem but it's most likely caused by the number
of files in a directory and the type of the file that Emacs is filtering through.
The workaround is to read the file without any contents conversion,
M-x find-file-literally <Enter> filename <Enter>. After opening the file, one can tell Emacs to encode the file accordingly, e.g.
to syntax highlight shell scripts,
M-x sh-mode <Enter>.
To make this change permanent, the ~/.emacs file to add:
(global-set-key "\C-c\C-f" 'find-file-literally)
Batch job submission failed: Invalid account or account/partition combination specified
This error usually indicates that one is trying to run a job in general partition, but the research group does not have an allocation or has used all of their allocation for the current quarter. To view current allocation status, see this page. If your group is either not listed in the first table on this page or there is a 0 in the first column (allocation amount) your group does not have a current allocation. In this case, your group may want to consider completing an allocation request.
Jobs without allocation must run in the freecycle partion. They will have lower priority and will be preemptable. You can see what partitions/accounts you have access to by running the
myallocationcommand. There also are examples of account–partition pairs on the Slurm documentation page. Alternatives include using unallocated clusters (kingspeak, lonepeak, and tangent) or running on owner nodes as a guest (with the possibility of preemption).
This error can also be caused by an invalid combination of values for account and partition: not all accounts work on all partitions. Check the spelling in your batch script or interactive command and be sure you have access to the account and partition. To view the combinations you can use, use the sacctmgr command; more information (including example commands) can be found on the Slurm documentation page.
Batch job submission failed: Node count specification invalid
The number of nodes that can be used for a single job is limited; attempting to submit a job that uses more will result in the above error. This limit is approximately one-half the total number of general nodes on each cluster (currently 32 on notchpeak, 24 on kingspeak, and 106 on lonepeak).
The limit on the number of nodes can be exceeded with a reservation or QOS specification. Requests are evaluated on a case-by-case basis; please contact us (email@example.com) to learn more.
Required node not available (down, drained, or reserved)
or job has "reason code" ReqNodeNotAvail
This occurs when a reservation is in place on one or more of the nodes requested by the job. The "Required node not available (down, drained, or reserved)" message can occur when submitting a job interactively (with srun, for instance); when submitting a script (often with sbatch), however, the job will enter the queue without complaint and Slurm will assign it the "reason code" (which provides some insight into why the job has not yet started) "ReqNodeNotAvail."
The presence of a reservation on a node likely means it is in maintenance. It is possible there is a downtime on the cluster in question; please check the news page and subscribe to the mailing list via the User Portal so you will be notified of impactful maintenance periods.
You can change your shell in the Edit Profile page by selecting the shell you'd like and clicking "Change." This change should take effect within fifteen minutes and you will need to log in again on any resources you were using at the time. That includes terminating all FastX sessions you may have running.
If you only need to use a different shell (but don't want to change your own), you
can open the shell or pass commands as arguments (e.g.
tcsh -c "echo hello").
You can change the email address CHPC uses to contact you in the Edit Profile page .
Linux defines temporary file system at
/var/tmp where temporary user and system files are stored. CHPC cluster nodes set up temporary
file systems as a RAM disk with limited capacity. All interactive and compute nodes
have also a spinning disk local storage at
/scratch/local. If an user program is known to need temporary storage, it is advantageous to set
environment variable TMPDIR which defines the location of the temporary storage and point it to
/scratch/local. Or, even better, create an user specific directory,
/scratch/local/$USER, and set
/scratch/local to that as shown in our sample
Default CHPC home directories have a 50 GB storage limit, which when exceeded does not allow one to write any more files to the home directory. As some access tools like FastX rely on storing small files in user's home directory upon logging in, it will fail.
To display quota information either run
mydiskquotacommand or log to CHPC user personal details and scroll down to Filesystem Quotas.
The remedy for this is to clean up files in the home. To do that, log in using a terminal tool, such as putty or Git bash on Windows, or terminal on a Mac. Delete large files (using the rm command). You may be also able to scp large files back to your desktop with WinSCP (Windows) or Cyberduck (Mac).
mkdir -p /scratch/general/nfs1/$USER
mv big_file /scratch/general/nfs1/$USER
To find large files, in the text terminal, run
du -h -d 1|sort -h command in your home directory to show disk space used per directory (the largest
directory will be at the bottom). Then cd to the directory with the largest usage and continue till you find the largest files.
If you clean a few files and are able to open FastX session again, run graphical tool
baobab which sorts the directories by their size and makes it easier to find all the potentially
useless large files.
Run the following command:
pestat -u $USER. Any information shown in red is a warning sign. In the example output below the
user's jobs are only utilizing one CPU out of 16 or 28 available:
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (MB) (MB) JobId User ...
kp016 kingspeak* alloc 16 16 1.00* 64000 55494 7430561 u0123456
kp378 schmidt-kp alloc 28 28 1.00* 256000 250656 7430496 u0123456
Another possibility is that the job is running low on memory, although we now limit
the maximum memory used by the job via SLURM so this is much less common than it used
to be. However, if you notice low free memory along with low CPU utilization, like
in the example below, try to submit on nodes with more memory using the
#SBATCH --mem=xxx option. An example of pestat output of a high memory job with a low CPU utilization
Hostname Partition Node Num_CPU CPUload Memsize Freemem Joblist
State Use/Tot (MB) (MB) JobId User ...
kp296 emcore-kp alloc 24 24 1.03* 128000 2166* 7430458 u0123456
First, check how efficiently are your jobs using the compute nodes. This can give some clues on what's the problem.
There can be multiple reasons for this, ranging from user mistakes to hardware and software issues, the most common, in the order of commonality:
- not parallelizing the calculation. In the HPC environment, we obtain speed by distributing the workload onto many processors. There are different ways to parallelize depending on the program workflow, starting from independent calculations, to explicit parallelization of the program using OpenMP, MPI or interpreted languages like Python, R or Matlab. For some basic information on how explicit parallelization see our Introduction to Parallel Computing lecture, or contact us.
- user mistakes in not supplying the correct number of parallel tasks or hard coding the number of tasks to run instead of using SLURM variables like $SLURM_NTASKS or $SLURM_CPUS_PER_NODE. Check your SLURM script and program input files and if in doubt contact us.
- inefficient parallelization. Especially MPI can be sensitive to how efficiently is the program parallelization implemented. If you need help in analyzing and fixing the parallel performance contact us.
- hardware or software issues on the cluster. If you rule out any of the issues listed above please contact us.
We are enforcing memory limits in jobs through the SLURM scheduler. If your job ends prematurely, please, check if the job output has "Killed" at or near the end of the output. That would signalize that the job was killed due to being low on memory.
Occasionally the SLURM memory check does not work and your jobs end up either slowing down the nodes where the job runs or puts the nodes to a bad state. This requires sysadmin interaction to recover the nodes, and we usually notify the user and ask to correct their behavior - either by asking more memory for the job (SLURM's --mem option), or checking what they are doing and lowering their memory needs.
In either case, in these situations it is imperative to monitor the memory usage of
the job. A good initial check is to run the pestat command. If you notice that the memory is low, the next step would be to ssh to the affected
node, and run the
top command. Observe the memory load of the program and if it is high and you notice
kswapd processes taking some CPU time, this indicates that the program is using too much
memory. Delete the job before it ends up putting the node in a bad state, remedy the
memory needs as suggested above.
Virtual Private Network (VPN) makes your computer to look like it was on the campus, even if you are off-site. However, the way how the campus VPN is set up, you may not be able to access certain out of campus internet resources.
We recommend to only use VPN if one needs to:
- map network drives to CHPC file servers
- use remote desktop to connect to CHPC machines that allow remote desktop (e.g. Windows servers)
- connect to the Protected Environment resources
All other resources do not need VPN. These include:
- ssh or FastX to connect to CHPC general environment Linux clusters
- accessing secure websites that require University authenticated login, such as CHPC's webpage, various other campus webpages (Canvas, HR,...) Box, Google Drive, etc.
You can do that by going to our unsubscribe page. However, note that e-mail announcements are essential in keeping our user base aware of what's happening with our systems and services, therefore we strongly recommend to stay subscribed. We try to keep the messages to a minimum, at most a couple a week, unless there is a computer downtime or other critical issue. For ease of e-mail navigation/deletion, we have adopted four standard headlines for all of our messages:
- CHPC DOWNTIME - for announcements related to planned or unplanned computer downtimes
- CHPC PRESENTATION - for CHPC lectures and presentations
- CHPC ALLOCATIONS - for announcements regarding the CHPC resources allocation
- CHPC INFORMATION - for announcements different than the three above
One possibility for connection failure is that the server is down, either due to planned or unplanned downtime. The best way to follow up on those is to make sure you have an up to date e-mail contact at your CHPC Profile, in order to get our announcements, or, check our latest news. To determine if this is the case, watch our announcements, or try to connect to another CHPC machine in case this is a localized failure.
Another common possibility is a connection disabling or a lockout after multiple failed login attempts due to incorrect passwords. Both CHPC and campus authentication implement various measures to prevent "brute force" login attacks. After certain number of failed logins within a certain time period, there login will be disabled for a certain period of time. These parameters vary between the general and protected environment and the campus authentication and often are not public, for security reasons, but, in general, the period of disablement is an hour or less from the last failed login attempt.
The best approach to deal with this is to try to log into a different machine (e.g. kingspeak2 instead of kingspeak1, or even a different cluster), or, in the worst case, wait till the login is enabled again. It is counter-productive to try to login again, even with the correct password, since the disablement timer resets.
If login to no CHPC resource works, the problem may be with the campus authentication lockout. CHPC uses the campus authentication to verify user identity. To check if campus authentication works, try any of the CIS logins. If they don't work, contact the campus help desk.