In the HPC, the backend compute nodes do not have access to the external internet. As such any data set must be first downloaded into HPC from one of the login nodes. This download would need to be done before scheduling a job which would access the data on the backend compute nodes.
There are two filesystems within the HPC that could hold the data (/WAVE/projects/ and /WAVE/datasets). The primary difference between the two filesystems has to do with backup and recovery. The project's filesystem is backed up on a daily basis, whereas the datasets filesystem is not. The datasets filesystem is appropriate for large external datasets which can be recovered by downloading a fresh copy from the external source.