Load sharing recalls between multiple volume groups Abstract: It will be shown how a DMF administrator can predict which tapes will be required by a volume group for future recalls and in what order. This capability will then be used to distribute recalls across multiple volume groups, thereby increasing the utilisation of the site's tape drives. Overview: - A quick review of the relationship between Volume Groups (VGs) and Drive Groups (DGs). - How the VG recalls files from tape. - How this is used to show the order in which tapes will be requested in order to satisfy currently queued recall requests. - How both of these can be used together to transfer recall requests from a heavily loaded VG to another with idle tape resources. Diagram from SGI DMF Admin Guide, show the hierarchical nature of the Library Server (managing a library), Drive Groups (managing a pool of fungible drives) and Volume Groups (managing a pool of media holding at most one copy of a user file). NB: Inaccurate for 4.1 How the volume group (VG) uses tapes for recalls: - Files are recalled from tape in units of one or more "chunks". Chunks do not span tape volumes. - When the VG receives a recall request for a file, it finds out the tape(s) on which the chunk(s) reside. Multiple chunks are processed in parallel with the tapes for each chunk being read concurrently if possible. - If a chunk is on a tape currently being used for other recalls: + its details are passed to the dmatrc process owning the tape, which adds it to the list of other queued chunks for that same tape. - If a chunk is on a tape not currently mounted for recalls: + it's queued inside the VG, ordered by time. + When the VG gets permission to mount a tape, which might be immediately, it chooses the tape required by the oldest chunk requested. + After the new dmatrc process has mounted the tape, it is passed details of all chunks to be recalled from that one tape, irrespective of their age. - When the final chunks for a file are read, the daemon is notified of completion and entries are deleted from queues, even though the tape is still in use for other requests. (Simplified; ignoring some aspects of partial-state files.) (There may sometimes be an impression that some recalls are "queue jumping" because they are not being processed in the order in which they arrived. This is a result of the above optimisations.) - If dmatrc has a problem mounting or reading the tape: + it informs the VG of the chunks that it couldn't process. + the VG passes details of the files containing those chunks on to the DMF daemon. + The daemon then reissues the file recall request to another VG if possible - the "secondary" VG for the file. (More glossing over of partial-state files.) dmorder: The main purpose of this script is to allow you to answer queries like: "How long do I have to wait?" "Who's doing the implicit recalls?" "Who's got the most recalls queued?" "Who's been waiting longest?" See sample human-viewable dmorder output: The data needed to predict future tape usage (for recalls and moves) comes from two places: - a slightly modified dmstat for + the DMF daemon's request queue. + the identification of the tape(s) required for each request. + the list of currently mounted/mounting tapes. - dmvoladm for the list of tapes with the HLOCK flag set (optional) dmorder - logic flow dmorder groups recall/move requests by the tape(s) they will require and lists these tapes ordered by the age of the oldest request requiring them. Tapes which are currently mounted or mounting are shown ahead of the others, as they are in active use. VGs normally follows this order, but there is no guarantee. From time to time, for reasons which are not externally visible, it will mount a tape out of order. Another anomaly occurs when a files is to be recalled from a tape which is currently being used for migrations. This results in the recall blocking until the VG has finished writing to the tape, which with high capacity tapes can take hours. run_load_level: Like run_merge_mgr on which it is based, run_load_level is able to lie in wait until it decides that there are unused tape drives available which it can appropriate. When that happens, it uses dmorder to find out the next few tapes likely to be used by (in example) VG sec - in this case C57054 and R30643 - and sets their HLOCK flag using dmvoladm. Then it waits for the VG to attempt to mount them, in which case it fails. If so, it throws the recalls back to the daemon which tosses them over to the other VG with the secondary copies (VG te2 here). Either way, after a little while (115 seconds for us), it clears the HLOCKs and is ready for the next data from dmorder. Repeat If the VG didn't attempt to mount them in that time, it doesn't matter. Maybe on the next cycle. Or maybe other tapes will be chosen by then. Or maybe there will be no spare drives by then. Results If the secondary VG's DG has 4 drives available, this script delivers the equivalent of about 3 extra drives to the recall process. When the rightful workload in that DG increases, run_load_level backs off. Requirements: - All files migrated to the targeted VG must have second copies - The tapes holding the secondary copies must be in the silo - The two VGs concerned should be in different DGs or there's no point - You'll have to modify run_scan_logs to grep out all the extra error messages - You can only aim it at VGs which contain only primary copies. That is, if you have three VGs called A, B & C, and files go to either A & B or B & C depending on, say, file space, then you could point run_load_level at A, but not B. But if the pairs were A & C or B & C, then both VGs A and B could be targeted. Deficiencies: - Sometimes it'll guess wrong; this does no harm though. - Because it relies on dmstat which uses the Resource Watcher for almost all of its data, it will be unaware of non-DMF tape usage. It will think the drives are more idle than they really are, which ruins the point of watching for idle drives. - dmorder shows details of the owners of the files being recalled, which it gets from the passwd file. No attempt has been made to add LDAP or NIS support. - Untested for multiple target VGs or with OpenVault. Conclusion: dmorder provides a useful tool to detect unusual patters in users' recall activity, and to answer some common queries from the users. run_load_level allows us to harness scarce drive resources which would otherwise lie idle during office hours. Copies of this presentation and related files can be found at http://hpc.csiro.au/users/dmfug/Presentations_Oct09/load_sharing/