Pipelines and Archives: 2008

2008-12-19

Near minutely raw data movement

Raw observation data is being put both in jcmt database & CADC staging area not much later when it arrives.

A new program[0] endlessly checks (currently) about every minute to see if there are any observations to process. If there are, the database is fed, followed by symbolic link creations for CADC's consumption. This should help avoid massive data transfers to CADC twice a day. Note that previously involved programs will keep running concurrently until everybody involved is satisfied that raw data is being entered and/or moved as desired.

All this started yesterday slightly wet, cloudy Hawaiian evening.

[0] enterdata-cadc-copy.pl is a wrapper around JSA::EnterData & JSA::CADC_Copy modules, which were respectively generated from jcmtenterdata.pl & cadcopy.pl.

2008-12-04

QA-enabled pipeline released in Hilo

ORAC-DR has been updated in Hilo to include quality-assurance testing. Based on a number of QA tests, observations are given a pass/questionable/fail status. QA is automatically done on all science observations, and survey-specific QA parameters can be given.

This version will eventually be released to the summit, where it will give telescope operators feedback on which surveys are suitable to do, along with enhancing the JCMT Science Archive pipeline.

2008-06-27

SCUBA-2 DR pipeline

A belated announcement that the SCUBA-2 data reduction pipeline passed its "lab acceptance" earlier this month. Full report at http://docs.jach.hawaii.edu/JCMT/SC2/SOF/PM210/04/sc2_sof_pm210_04.pdf

2008-04-01

Initial results of "better" ORAC-DR reduction

As previously mentioned, ORAC-DR is improving how ACSIS data are reduced. To show the progress between "summit" and "better":

integrated intensity map, group coadd: summit better

integrated intensity map, single observation: summit better

intensity-weighted velocity map, group coadd: summit better

A few notes:

By "summit" pipeline I mean the pipeline currently running at the summit. This pipeline will be replaced by an "improved" pipeline pending JCMT support scientist approval. The "improved" pipeline will not be run at CADC, they will run the "better" pipeline that created the "better" images linked above.

The "group" summit integrated intensity map is not generated by the pipeline, it's created by manually running wcsmosaic to mosaic together the individual baselined cubes (the _reduced CADC products), then collapsing over the entire frequency range. This is how the summit pipeline would create those files, though.

Ditto for the group summit velocity map, except the pipeline wouldn't even create those in the first place, as it doesn't know which velocity ranges to collapse over to get a proper velocity map. This example is just done by naively collapsing over the entire velocity range. The "better" pipeline automatically finds these regions and creates velocity maps -- not only for the coadded group cube, but also for individual observation cubes.

The difference between the "better" pipeline (which is what will be running at CADC) and the "improved" pipeline (which is what will be running at the summit) is very small for this given dataset.

2008-03-26

CUPID ClumpFind and backgrounds

Jenny Hatchell has been comparing the CUPID implementation of the ClumpFind algorithm with the IDL implementation by Jonathan Williams. The IDL version differs in one or two significant respects from the original algorithm published in ApJ, and so CUPID provide a switch that selects either the published algorithm or the IDL algorithm. If the IDL algorithm is selected, Jenny finds that the IDL and CUPID implementations allocate exactly the same pixels to each clump. Good news. And more good news is that the CUPID implementation is at least an order of magnitude faster than the IDL implementation.

However, Jenny noted that the clump sizes reported by CUPID were not the same as those reported by IDL. Both implementations use the RMS displacement of each pixel centre from the clump centroid as the clump size, where each pixel is weighted by the corresponding pixel data value. So in principle they should produce the same values. The difference turns out to be caused by the fact that CUPID removes a background level from each clump before using the pixel values to weight the displacements. IDL , on the other hand, uses the full pixel values without subtracting any background. Thus, increasing the background level under a clump will produce no change in the clump sizes reported by CUPID. IDL, however, will report larger clump sizes due to the greater relative emphasis put on the outer edges of the clump.

So should a background be subtracted or not? Having the reported clump size depend on the background level seems an undesirable feature to me. But if you want to compare CUPID results with other systems (e.g. the IDL ClumpFind in this case) that do not subtract a background, you CUPID also needs to retain the bacground level to get a meaningful comparison. Consequently, I've added a parameter to CUPID:FINDCLUMPS to select whether or not to subtract the background before calculating clump sizes. The default is for the background to be subtracted unless CUPID is emulating the IDL algorithm (as indicated by the ClumpFind.IDLAlg configuration parameter).

If the background is retained in CUPID, Jenny found that the CUPID and IDL clump sizes match to within half a percent. So things look OK.

David

2008-02-29

Preliminary Wrapper script released

A first stab at a wrapper script (jsawrapdr) has been released to users in Hilo. It matches the interface specified by CADC.

What it does:

retrieve data files from the supplied list
convert them to NDF if required
determines the correct ORAC-DR instrument name based on the data
checks that PRODUCT information matches for all files
determines whether to run ORAC-DR or PiCARD
converts products back to FITS

What it doesn't do yet:

Provenance is not quite correct. It is possible to refer to a parent that will not be archived.
There is no standardised approach to logging Standard Output and Standard Error
dpCapture does not automatically copy products to the CADC transfer directory.

It's enough to get us started.

CADC data transfers now working again

Data transfers to CADC are functioning again. We've had real problems reconfiguring replication to CADC and to our standby server (which use different techniques) but now everything seems to be working. Now that CADC have headers from recent observations they will again start accepting our raw data. Transfers have been initiated and are currently complete to 20080215 (there are quite a few files to transfer). Data retrieval requests from users via the OMP will shortly be redirected back to CADC.

2008-02-25

ORAC-DR: CADC+batch mode

A brief mind-dump on the processing steps ORAC-DR will probably take at CADC when run in batch mode:

_cube files created. These are then forgotten about by ORAC-DR but are stored by CADC.
Run initial steps of Remo's script on time-series data. Removes any gross time-series signal through collapsing and rudimentary linear baselining.
Run MAKECUBE using every member observation of a Group, creating tiles.
Run remainder of Remo's script on each tile, which uses a combination of smoothing and CUPID to create baseline region masks and to remove baselines.
Take the baseline region mask from the previous step along with the original input time-series data, and throw them through UNMAKECUBE. This will create time-series masks.
Apply the time-series masks to the original time-series data.
Run MFITTREND with a high-order polynomial (or spline, or whatever) on the masked time-series data. These cubes shouldn't have any signal and should be pure baseline.
Subtract the baselines determined in the previous step from the original time-series data.
Run MAKECUBE on the baselined time-series data for each observation to create the _reduced / _rimg / _rsp files.
If necessary, WCSMOSAIC the _reduced files for each observation to create an "even better" group, which can then be used to determine a better mask and then possibly iterate through the UNMAKECUBE to _reduced generation steps.

The Wrapper

Background: there is a system called, imaginatively, the wrapper. Its purpose is to wrap the data processing specifics so as to present a generic interface to the CADC data processing system that is under development.

The wrapper is on TimJ's to-do list, and so is at the mercy of his higher priority SCUBA-2 work. In an attempt to push something out to CADC before working on the SCUBA-2 translator, he is writing a prototype with the following functionality:

has a stub dpRetrieve, emulating the system that will eventually fetch the data needed from the CADC database
examining the data to determine whether it is raw or already a product
converts any FITS files to NDF
runs ORAC-DR or PiCARD as appropriate given the above information
converts any NDF products back to CADC-compliant FITS
calls a stub dpCapture (the real dpCapture imports any products into the CADC system)

The main problem that stops this from being more than a prototype is the provenance system. In our NDF based systems provenance is a time series - file A turns into file B which turns into file C.... eventually resulting in file E, the final product. So the provenance looks like this: A, B, C, D, E. In the CADC system, provenance is the nearest parent existing in the archive. So, if only A, B, D and E exist in the database (because C happens to be an intermediate file of no lasting importance) the provenance for E is D, but the provenance for D is B. Therefore, the wrapper has to make sure that at the end of any processing the provenance is correctly fixed to display only parents existing in the CADC archive.

The intended solution for this is for DavidB to commit some NDG patches to allow TimJ/the wrapper to remove C (in the previous example) from the provenance. Also, the wrapper needs to rename A, B, D and E to the CADC naming convention so they can be matched to entries in the archive.

So as not to hold other parts of the project up, the intent is for the prototype to be delivered to CADC in the next few days without this provenance-related functionality, and to come back and fix this when the SCUBA-2 translator work allows.

2008-02-13

OMP to CADC connection is down

At the tail end of last week we had a hardware failure at the summit with our primary OMP database server. We switched to the new Sybase 15 64-bit servers but they have not been configured correctly to replicate the JCMT header table to CADC (full database replication is working to the backup 64-bit server in Hilo). Until we get the CADC replication up and running there will be no transfers of raw data to CADC. This is because CADC validates transfers against it's copy of the header table and rejects observations that are unknown to them.

We hope to have replication running by early next week but in the mean time we have reconfigured OMP data retrievals to serve the raw data files from JAC.

2008-02-11

specx2ndf now creates variance

Forget to mention that I've modified specx2ndf so that it creates a Variance component in the output NDF based on the Tsys value in the specx file. The variance values in the output NDF are constant since each specx file seems to contain only a single Tsys value.

2008-02-04

Processing 3D cubes with FFCLEAN

I've just modified kappa:ffclean so that it can:
1) process 3D cubes. It will do this either by processing the cubes as a set of independent 1D spectra, or as a set of independent 2D images (see new parameter AXES)
2) store the calculated noise level in the output variance array (see new parameter GENVAR)

This was motivated by my experiments with the new smurf:unmakecube command as a means of getting an estimate of the noise level in each residual spectrum.

2008-02-01

Creating artifical time series from a sky cube

I've just added a new command to smurf called UNMAKECUBE. It takes a (ra,dec,spec) sky cube and generates artifical time series by resampling the sky cube at the detector positions in a set of existing reference time series NDFs. It's a sort of inverse to MAKECUBE. It should be useful for investigating baselines, and for iterative map making. I'm currently playing around with it, using data from Christine Wilson.

2008-01-31

Time accounting for JLS

Another issue for JLS observing is with time accounting. How to account in the OMP the time for data that has been deemed unacceptable. These data do not get charged to the surveys so if the data were initially QUESTIONABLE retroactive action will need to be taken to correctly account for that time (notwithstanding the issue of shared calibrations...see later). So that we can track how much time has been REJECTed by each survey, there shall be a special project code (eg. MJLSG00) for each which we will use to charge REJECT observations to.

The idea is that when the obslog flag changes:

1. an email is triggered to ACC and PIs notifying the change

2. if the change is to REJECT then the release date is automatically changed to TODAY (or equivalent)

3. ACC runs up nightrep for the night in question and changes the time accounting accordingly.

4. changes to flags are propagated to CADC

In a situation where calibrations are provided by the observatory this system should work flawlessly (and a tool which takes care of the time accounting automatically would also be feasible). However, in the current situation where calibrations are shared amongst the projects, it is difficult to do the time accounting properly in this scheme as it is not immediately obvious how much calibration time should be taken with the REJECTed observation. It would have to be recalculated.

How to properly deal with BAD/QUESTIONABLE data within the JLS

The problem stems from what happens to questionable data which remains in some form of limbo until its status is deemed to be GOOD or BAD. Setting data to BAD is an issue in itself as such data are usually BAD because of a fault. However, the Legacy nature of JLS means that some data will be deemed to be unacceptable for the surveys and so should not be processed into Advanced Data Products. These data are not intrinsically BAD and so the plan is for them to be immediately released to the public.

We resolve this issue by having a new quality parameter in obslog - REJECT. Definitions:

GOOD: data is good and is processed by the pipeline

QUESTIONABLE: data may have problems with it - a human needs to look and make a decision on its quality

BAD: data is not good, do not process

REJECT: data does not meet agreed standards for survey

We don’t expect to be using the REJECT flag during normal PI observations. Furthermore, it is expected that with working QA in the survey pipelines the number of REJECT and QUESTIONABLEs will be small (but there will be enough, especially at the beginning as we're bedding the system, that we need to deal with them appropriately).

The following table summarises what happens to data with these flags:

_cube _reduced group proprietary

GOOD Y Y Y Y

BAD Y N N N

QUEST. Y Y N Y

REJECT Y Y N N

ADP charged VO/master product

GOOD Y Y Y

BAD N N N

QUEST. N Y N

REJECT N N Y

(N.B: QUESTIONABLE data should not be combined into the public VO product as those data are in an undefined quantum state and until their wave functions have collapsed into either of GOOD, BAD or REJECT you don’t know what to do with them)

2008-01-30

ACSIS DR and hybrid observations

In the version of the ACSIS pipeline that will ship in the upcoming Starlink "humu" release subsystems from ACSIS are processed independently of each other. This is fine for true multi-subsystem observations such as simultaneous C18O and 13CO modes but is not the correct thing to do when a true hybrid mode is observed. This was not an issue when we relied on the "real-time" DR to do the sub-band merging but since the decision was made to configure 500MHz and 2GHz observations as pseudo multi subsystem observations (albeit with shared reference pixel so that the channels are aligned) it has become obvious that ORAC-DR needed to be modified.

This week I modified the internals of ORAC-DR to allow it to recognize hybrid observations and treat them as a single "frame" (in ORAC-DR speak). It was a little more involved than expected because there was no merging of header information in the pipeline outside of bespoke implementations in UKIRT spectroscopy and SCUBA-2 classes. I moved some header parsing code into a base class and removed the special code from UKIRT (SCUBA-2 is still special). This required some minor changes to all the header translation code but was worth it since all instruments can make use of the header merging.

Next step is to actually use this information to merge the sub-bands. One minor caveat in all this is that ORAC-DR will still not try to merge multi-subsystem observations simply because subsystems overlap. If that is required (for example for the SLS) then we need to be told the requirements.

2008-01-28

DB replication to CADC

Some confusion as to where we are with getting replication from the ASE 15 system to CADC. Phone call with anubhav, timj and isabella tomorrow (Tuesday) 12pm HST.

scubadev

Scubadev's mysterious lack of speed continues to be a concern, especially since it is a baseline spec for the summit DR computers. Had a chat with Tim where we concluded that after the Starlink release is out of the way (1-2 weeks) we will take it down and turn it into a standalone gentoo box, then a standalone CentOS box. This should allow us to narrow down whether the problem is related to hardware, OS, or the integration in to the network.

OMP ACSIS data retrievals

Last week we had a strange problem where data could be retrieved from the OMP from November onwards but between Feebruary and October 2007 data retrievals failed because the OMP sent the wrong filenames to CADC. Turned out that the new test database (running Sybase 15) had been loaded with data up to end of October and that was triggering a new logic path through the OMP. Usually, the OMP failed to find any entries in the database and fell back to looking on the data disk for raw data. This always works and always finds the right files. When rows are found in the database there is no need to look on disk (the database is much faster than looking at files) and once the test database was initialised the DB lookups were working properly. The only problem was that the query to the ACSIS database did not return the filename information from the FILES table and therefore the OMP was forced to guess the filename. For data taken since we renumbered the subystem numbers the guess was wrong and CADC were asked to serve files that didn't exist.

I fixed the problem last week and now retrievals work with database and file lookup. I was able to cleanup quite a lot of code in the process and the Astro::FITS::HdrTrans module was made a little cleverer and can now tell the difference between a database result and a header read from a file. Apologies to people who experienced retrieval problems over the past 2 weeks.

2008-01-25

ACSIS wrapper script

Short telecon arranged for Monday 12:30 HST (after regular JSA 12:00 HST CADC team meeting) so that Tim can put forward his ideas (mostly to Sharon) about the ACSIS processing wrapping script. There is some interest in testing this by February 7th, which will be tight.