Notes on using Diagnostic Test Tools
with data in the LIGO data archive

Peter Shawhan, Daniel Sigg
January 15, 2004
Updated February 21, 2005

Background     (instructions may be found farther down this page)

The main LIGO data archive at Caltech uses a robotic tape handling system to keep a large amount of data accessible for analysis. The SAM-QFS software used to run the archive makes all of the data appear to be on a disk filesystem mounted on certain computers (e.g. alterf), so that it can be accessed directly, typically with a delay of a minute or less (depending on tape drive usage and on whether the relevant data is already cached on disk). The filesystem name at Caltech is /archive. The observatory sites have smaller robotic systems which provide similar access to moderately large amounts of data from computers such as fortress, decatur, touro, and river. Each of these machines sees the /archive filesystem for the local site.

Version 2.3 and later of Diagnostics Test Tools (LIGOtools package "dtt") is able to read data directly from these filesystems, since they look like disk filesystems as far as it is concerned. In particular, it knows how to deal with the naming convention of directories in the hierarchy containing different epochs of data, which previous versions of DTT did not know how to handle. However, the usual method of setting the DTT Input to an appropriate node of the directory tree (described here) is much too slow to use with the archive or, in general, for datasets containing more than ~1 day of data. For multi-month science runs, it can take many hours to start up a DTT session in this fashion. The main reason for this is that DTT scans at least one out of every thousand files to check for changes in the channel list, and each such scan typically requires a tape mount/read cycle.

Fortunately, DTT provides a way to get time intervals and channel information about a dataset from index files, avoiding the need to scan the data. With the proper index files, even large datasets can be accessed with minimal overhead. Version 2.3a of the LIGOtools "dtt" package, when run on a machine which has direct access to a local data archive, has been set up to automatically present the user with a list of datasets in the archive, and has the proper index files to access these datasets efficiently. These index files are currently maintained by hourly cron jobs running under the 'pshawhan' account on alterf, fortress, decatur, and river. If some desired data in the archive does not seem to be available through this mechanism, please contact Peter.

General instructions

To analyze data in the LIGO data archive with one of the DTT modes (i.e. Fourier Tools or Triggered Time Series Measurement), do the following:
  1. Log in to a machine which has access to the archive via SAM-QFS or NFS. Currently, these machines are alterf at CIT; fortress at LHO; and decatur, touro, and river at LLO. Others may be added in the future. The software is written in such a way that the list of datasets should appear on any machine which has access to the archive, without explicit human setup for each machine.

  2. Type dtt to bring up the DTT main menu window.

  3. Click on the "Triggered Time Series Measurement" button if you want to view the raw data, or the "Fourier Tools" button if you want to calculate power spectra, etc. A new window will be created, though it may take several seconds to appear.

  4. In the "Input" panel, make sure that the Data Source Selection is set to "LiDaX", and make sure that the LiDaX Data Source Server is set to "Local file system".

  5. Click on the triangle to the right of the "UDN:" field. This should bring up a list of datasets. (If the list only contains a single item, "--new--", then the software does not think that this computer has access to the archive. If this occurs but you think that the /archive filesystem is in fact mounted, contact Peter for assistance.) The datasets have long names, starting with the LIGOtools directory path, but the interesting part should be fairly self-evident at the end of the name. For instance, the dataset "/ligoapps/ligotools/config/dtt/S3.L1.LHO.udn" is Level-1 RDS data from Hanford from the S3 science run. Select one of these datasets from the list.

  6. Now you can click on the "Measurement" tab to bring up that panel. It should appear within a few seconds.

  7. Select the Measurement Channels using the pull-down menus. Also remember to enable them using the checkboxes to the left of the channel names.

  8. If you are using the "Triggered Time Series Measurement" function, you can apply a filter to the channels being processed. See this note for instructions.

  9. Specify the start time. To see what time intervals are available in this dataset, click on the "Lookup..." button to bring up a "Time Selection" window. For a long dataset, there may be many segments separated by gaps. (Note that the segment numbers in this list just indicate contiguous blocks of available frame data, and are unrelated to locked segments or science data segments.) If you want, you can click on a segment to select it, then click on the "Set start" button, then click on "Ok". This updates both the UTC and GPS time fields back in the main window. You also can enter/modify times manually, but be sure the relevant time format is selected (via the radiobutton circles); otherwise the time used by the program may not be what you intended.

  10. Set other parameters as appropriate, and click on the "Start" button at the bottom of the screen. Reading the data might take a few minutes or more, even for a small amount of data, if it has to be retrieved from tape. If you get the status message "Test timed-out", this may mean that DTT thought it took too long to get the data; in this case, retrying may work. On the other hand, this message may mean that not all of the requested data is available in the archive.

  11. Click on the "Result" tab at the top of the screen to see your results.

Reading from multiple files/directories

DTT can read from multiple data sources simultaneously, e.g. LHO data and LLO data from the same time period. You set this up in the "Input" panel as follows:
  1. Change the LiDaX Data Source from "single" to "multiple".

  2. Click on "Add...". This brings up a "Server Selection" window.

  3. Click on "Local file system", then click on the ">>" button once for each dataset that you want to read from. This builds a list on the right side of the window. Once the appropriate number of data sources has been added to the list, click on the "Ok" button.

  4. Back in the main window, the LiDaX Data Source Server should now be set to the first data source, i.e. "Local file system (0)". Click on the triangle to the right of the "UDN:" field and select the first dataset you want from the pull-down menu.

  5. Switch to the other data source (i.e. "Local file system (1)") using the pull-down menu, then select the file or directory as described above.

  6. Once all data sources have been specified, click on the "Measurement" tab. The list of channels available for processing is the union of the channel lists from all of the data sources. Note, however, that the list of time intervals displayed when you click on "Lookup..." apparently is just the list from the last dataset, and does not reflect any gaps which may exist in the other dataset(s).

Technical Notes

When you type "dtt", you are actually running a script which sets some environment variables and then executes the mainmenu program. In version 2.3a and later of the LIGOtools "dtt" package, the dtt script calls another script, $LIGOTOOLS/config/dtt/startup_environment, which sets the UDNLIST environment variable to make the default list of datasets available. The startup_environment script does this only if the machine on which it is running has access to the archive, and if the UDNLIST environment variable is not already set.

Maintaining the list of available datasets, and the index files for each dataset, is a time-consuming operation. On each cluster which has at least one machine that can see the archive, the index files are generated with a script called make_default_udns in the $LIGOTOOLS/config/dtt directory. cron jobs have been set up to regenerate these lists hourly, but they currently only look for updates to the S4 files; Peter will have to remember to update the cron jobs for each new science/engineering run.