Load sharing recalls between multiple volume groups

Abstract:

It will be shown how a DMF administrator can predict which tapes will
be required by a volume group for future recalls and in what order.

This capability will then be used to distribute recalls across
multiple volume groups, thereby increasing the utilisation of the
site's tape drives.


Overview:

- A quick review of the relationship between Volume Groups (VGs)
  and Drive Groups (DGs).

- How the VG recalls files from tape.

- How this is used to show the order in which tapes will be requested
  in order to satisfy currently queued recall requests.

- How both of these can be used together to transfer recall requests
  from a heavily loaded VG to another with idle tape resources.


Diagram from SGI DMF Admin Guide, show the hierarchical nature of the
Library Server (managing a library), Drive Groups (managing a pool of
fungible drives) and Volume Groups (managing a pool of media holding
at most one copy of a user file).

NB: Inaccurate for 4.1


How the volume group (VG) uses tapes for recalls:

- Files are recalled from tape in units of one or more "chunks".
  Chunks do not span tape volumes.

- When the VG receives a recall request for a file, it finds out the
  tape(s) on which the chunk(s) reside.  Multiple chunks are processed
  in parallel with the tapes for each chunk being read concurrently
  if possible.

- If a chunk is on a tape currently being used for other recalls:

  + its details are passed to the dmatrc process owning the tape,
    which adds it to the list of other queued chunks for that same tape.

- If a chunk is on a tape not currently mounted for recalls:

  + it's queued inside the VG, ordered by time.

  + When the VG gets permission to mount a tape, which might be
    immediately, it chooses the tape required by the oldest chunk
    requested.

  + After the new dmatrc process has mounted the tape, it is passed
    details of all chunks to be recalled from that one tape,
    irrespective of their age.

- When the final chunks for a file are read, the daemon is notified
  of completion and entries are deleted from queues, even though the
  tape is still in use for other requests.  (Simplified; ignoring some
  aspects of partial-state files.)

    (There may sometimes be an impression that some recalls are "queue
    jumping" because they are not being processed in the order in
    which they arrived.  This is a result of the above optimisations.)

- If dmatrc has a problem mounting or reading the tape:

  + it informs the VG of the chunks that it couldn't process.

  + the VG passes details of the files containing those chunks on to the
    DMF daemon.

  + The daemon then reissues the file recall request to another VG
    if possible - the "secondary" VG for the file.  (More glossing
    over of partial-state files.)


dmorder:

The main purpose of this script is to allow you to answer queries like:
    "How long do I have to wait?"
    "Who's doing the implicit recalls?"
    "Who's got the most recalls queued?"
    "Who's been waiting longest?"


See sample human-viewable dmorder output:


The data needed to predict future tape usage (for recalls and moves)
comes from two places:

- a slightly modified dmstat for

  + the DMF daemon's request queue.

  + the identification of the tape(s) required for each request.

  + the list of currently mounted/mounting tapes.

- dmvoladm for the list of tapes with the HLOCK flag set (optional)


dmorder - logic flow

dmorder groups recall/move requests by the tape(s) they will require
and lists these tapes ordered by the age of the oldest request
requiring them.

Tapes which are currently mounted or mounting are shown ahead of the
others, as they are in active use.

VGs normally follows this order, but there is no guarantee.  From time
to time, for reasons which are not externally visible, it will mount
a tape out of order.

Another anomaly occurs when a files is to be recalled from a tape
which is currently being used for migrations.  This results in the
recall blocking until the VG has finished writing to the tape, which
with high capacity tapes can take hours.


run_load_level:

Like run_merge_mgr on which it is based, run_load_level is able to lie
in wait until it decides that there are unused tape drives available
which it can appropriate.

When that happens, it uses dmorder to find out the next few tapes likely
to be used by (in example) VG sec - in this case C57054 and R30643 -
and sets their HLOCK flag using dmvoladm.

Then it waits for the VG to attempt to mount them, in which case it fails.

If so, it throws the recalls back to the daemon which tosses them over
to the other VG with the secondary copies (VG te2 here).

Either way, after a little while (115 seconds for us), it clears the
HLOCKs and is ready for the next data from dmorder.

Repeat

If the VG didn't attempt to mount them in that time, it doesn't matter.
Maybe on the next cycle.  Or maybe other tapes will be chosen by then.
Or maybe there will be no spare drives by then.


Results

If the secondary VG's DG has 4 drives available, this script delivers
the equivalent of about 3 extra drives to the recall process.

When the rightful workload in that DG increases, run_load_level
backs off.


Requirements:

- All files migrated to the targeted VG must have second copies

- The tapes holding the secondary copies must be in the silo

- The two VGs concerned should be in different DGs or there's no point

- You'll have to modify run_scan_logs to grep out all the extra error
  messages

- You can only aim it at VGs which contain only primary copies.

  That is, if you have three VGs called A, B & C, and files go to either
  A & B or B & C depending on, say, file space, then you could point
  run_load_level at A, but not B.  But if the pairs were A & C or B &
  C, then both VGs A and B could be targeted.


Deficiencies:

- Sometimes it'll guess wrong; this does no harm though.

- Because it relies on dmstat which uses the Resource Watcher for
  almost all of its data, it will be unaware of non-DMF tape usage.
  It will think the drives are more idle than they really are, which
  ruins the point of watching for idle drives.

- dmorder shows details of the owners of the files being recalled,
  which it gets from the passwd file.  No attempt has been made to
  add LDAP or NIS support.

- Untested for multiple target VGs or with OpenVault.


Conclusion:

dmorder provides a useful tool to detect unusual patters in users'
recall activity, and to answer some common queries from the users.

run_load_level allows us to harness scarce drive resources which would
otherwise lie idle during office hours.


Copies of this presentation and related files can be found at
http://hpc.csiro.au/users/dmfug/Presentations_Oct09/load_sharing/