wiki:RemoteInputFiles

Version 5 (modified by davea, 11 years ago) (diff)

--

Remote management of input files

For a file to be used as an input file of a BOINC job, it must be available to BOINC clients via HTTP. The standard way to do this is put the file in the project's "download directory" on the project server.

For projects that use remote job submission, job submitters don't have login access to the server, so they can't store files there directly. Instead, BOINC provides two mechanisms that allow job submitters to place files on the BOINC server.

Each of these mechanisms deals with several issues:

  • File immutability: BOINC requires that a file of a given name can never be changed. Job submitters can't be expected to obey this rule: they must be able to submit one job with an input file of a given name, and a second job with an input file of the same name but different contents.
  • File cleanup: There must be some way to clean up files on the server when they are no longer needed.
  • Authorization: only users authorized to submit jobs should be able to move files to the server.

Note: both mechanisms upload files via PHP. PHP's default max file upload size is 2MB. To increase this, edit /etc/php.ini, and change, e.g.

upload_max_filesize = 64M
post_max_size = 64M

Content-based file management

This system is used by the Condor/BOINC interface. If may be useful for other systems as well. In this system, the name of a file on the BOINC server is based on its MD5 hash; thus file immutability is automatic.

File cleanup is based on file/batch associations. Each file can be associated with one or more batches. Files that are no longer associated with an active batch are automatically deleted from the server.

The system uses two Web RPCs. These are implemented as XML sent via HTTP POST; the RPC handler is html/user/job_files.php.

The following C++ interfaces are provided (in samples/condor/job_rpc.cpp). This is to be called on the job submission host; the files must exist on that host, and their MD5s must have already been computed.

extern int query_files(
    const char* project_url,
    const char* authenticator,
    int batch_id,
    vector<string> &md5s,
    vector<string> &paths,
    vector<int> &absent_files		// output
);

Inputs:

  • project_url: the project's master URL
  • authenticator: the job submitter's authenticator.
  • paths: a list of file paths on the calling host.
  • md5s: a list of the MD5s of the files.
  • batch_id: the ID of a batch whose jobs will reference the files (these jobs need not exist yet). The operation will fail if the user is not authorized to submit jobs to the batch's application.

Action: for each file, see if it exists on the server. If it does, create an association to the given batch.

Output:

  • return value: nonzero on error
  • absent_files: a list of files not present on the server (represented as indices into the file vector).
extern int upload_files (
    const char* project_url,
    const char* authenticator,
    vector<string> &paths,
    vector<string> &md5s,
    int batch_id
);

Inputs:

  • project_url, authenticator, batch_id: as above.
  • paths: a list of paths of files to be uploaded
  • md5s: a list of MD5 hashes of these files
  • batch_id: the ID of a batch with which the files are associated. The operation will fail if the user is not authorized to submit jobs to the batch's application.

Action: Upload the files, and create associations to the given batch.

Output:

  • return value: nonzero on error

If you use this system, periodically run the script html/ops/delete_job_files. This will delete files that are no longer associated with an active batch.

Per-user file sandbox

This mechanism allows job submitters to explicitly upload files via a web interface: PROJECT_URL/sandbox.php.

Links to the files are stored in a "sandbox directory" PROJECT_ROOT/sandbox/USERID/. The entries in this directory have contents

size MD5

The actual files are stored in the download directory, under the name sb_userid_MD5.

Currently, files in the sandbox are not cleanup up automatically. The web interface allows users to delete their files.