Working with a large amount of database data

Peter Shawhan
February 26, 2002

LDAS imposes a limit on the number of rows that can be returned from a database query. The limit is currently set to 10000, and it cannot be raised arbitrarily due to memory limitations. This is a serious bottleneck if you want to work with millions of database entries. One workaround would be to submit a series of queries, customized to subdivide the dataset into blocks of fewer than 10000 rows, but this would be tedious to do by hand.

To get around this limitation, the getMeta utility (FAQ article about guild and getMeta) takes an '-a' option which instructs it to retrieve all database entries matching the query. getMeta achieves this by submitting the query and retrieving the results into a temporary file, then scanning the file; if the file contains a number of rows which is an exact multiple of 1000, getMeta assumes that LDAS truncated the output. In this case, it constructs and submits a modified query which excludes the rows already received. getMeta repeats this process until all matching rows have been retrieved, then concatenates all the temporary files into a single big file. This can take a long time (the average retrieval rate is typically 100-200 rows per second), so you may wish to use the '-v' option, which causes getMeta to print a progress message each time it submits a query.

Guidelines for constructing queries

Since getMeta with the '-a' option can produce a rather large output file, it is a good idea to ask for only the columns that you are interested in, plus the event_id column (or some similar "unique ID"), which helps getMeta to modify the query to exclude rows that have already been received. Also, you must order the output based on one or more columns, typically start_time (or end_time, for the sngl_inspiral table). For example (with highlighting for emphasis):
  getMeta -a -v -d llo -q "select start_time,start_time_ns,amplitude,snr,event_id from sngl_burst where search='TFCLUSTERS' order by start_time,start_time_ns" -o llo_tfc.xml
It is also possible that you want more than 10000 rows, but not necessarily all the matching rows in the database. You can explicitly limit the number of rows, to 50000, for instance, by including the clause "fetch first 50000 rows only" at the end of the SQL query (after the "order by..." clause).

Reading a LIGO_LW file with a large number of rows

It is impractical to view a very large file (say, much larger than one megabyte) using guild. However, all other utilities for reading LIGO_LW files use the metaio parsing code to read the file sequentially, so they can handle an arbitrarily large file, typically reading it at the rate of several thousand rows per second.

The lwtprint command-line utility and the readMeta Matlab function allow you to specify a subset of rows to read from the file. For example, you can quickly print the first 10 rows of a big file using:

  lwtprint big_table.xml -r 1-10
and you can read the first 1000 rows into Matlab arrays using:
  dat = readMeta('big_table.xml',[1:1000]);