This page serves as a troubleshooting reference for common problems encountered in the SDPS. If the problem is not listed, it may be a new problem and therefore, undocumented. If you find the problem in the list, attempt to identify one of the possible causes as the problem. Only attempt to resolve the problem if you feel comfortable doing so. Otherwise, notify one of the individuals responsible for the function experiencing the problem.
Click on a category that best suits the problem
SeaWiFS Scheduler
Visual Database Cookbook (VDC)
Utility Programs (GUIs)
Miscellaneous Conditions
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| no confirmation mail sent to remote user | notification message was filtered into holding directory (see next problem) | If we have the login information for the node, the message can be moved to the incoming directory. Otherwise, the sender must be contacted, and login information must be provided. |
| scan_mail not running | Enable Scan_Mail task; turn on the scheduler | |
| remote_address is missing or invalid | Inform sender that address is incorrect | |
| mail does not get filtered into incoming-mail directory | subject line does not contain the string, "Data file notification" | Sender needs to ensure that the subject is correct |
| remote node not in resources/operational_hrpt.dat | If notification messages from the site are consistently correct, the node can be added to the operational_hrpt file | |
| file name conflicts with data type | Inform sender of the problem | |
| record for mail message does not get inserted into external_file table | message missing required information or contained invalid information | Inform sender of the problem |
| message for a file that was already ingested | No action necessary; the system will send a reply noting the duplicate notification message |
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| fetcher does not create ingest task for a file | incorrect permissions on create_master_ftp.csh or script does not exist | Ouch! Attempt to retrieve file from backup or reset the permissions if the script exists |
| incorrect permissions on ~resources/scripts_template/fetcher or the file does not exist | Ouch again! Attempt to retrieve file from backup or reset the permissions if the script exists | |
| if file is from an HRPT station, there could be time-constraints defined that defer the ingest. check the HRPT-time GUI and the fetcher log file. | No action necessary for this cause. Fetcher will scheduler a task for the file when the time window opens | |
| datatype value is unknown to fetcher. check external_file record and/or fetcher's log file. | Either the datatype value must be changed or the fetcher program must be modified to deal with the new datatype | |
| missing value for one of the required external_file fields. | Update the external_file record manually or remove the errant record from the database | |
| not enough exclusive time allowed for each ingest task to complete | the value for the INGEST_TASK_TIME_MINUTES variable may be set too low. | Use the environment editor to modify the variable |
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| ingest task errors out before transfer begins | entry for remote node missing from $HOME/.netrc file | Add the login information for the remote node to the $HOME/.netrc file |
| transfer fails | remote machine is not accessible | Use ingest monitor GUI to set the status for the file(s) to REMOTE_HOLD until the problem is resolved. Send a message to the station representative if the problem is not resolved within a day. |
| incorrect permissions on remote directory and/or file | Use ingest monitor GUI to set the status for the file(s) to REMOTE_HOLD, and send a message to the station representative detailing the problem. | |
| remote directory and/or file does not exist | None. The ingest script will generate a message to the originator of the notification message informing him that the file was not found in the specified location. | |
| login information for remote machine is incorrect | Use ingest monitor GUI to set the status for the file(s) to REMOTE_HOLD, and send a message to the station representative detailing the problem. | |
| local destination directory does not exist or has incorrect permissions set | Create the local destination directory and/or set the permissions correctly. | |
| local file fails to uncompress | uncompressed version of local file already exists | Remove the uncompressed version of the file. |
| file is corrupt | Retransfer. If the problem occurs on the second try, inform the station representative of the problem. | |
| database update fails | duplicate-record violation | Investigate why there is a previously entered record. If it is errant, remove it, and set up the external_file record so that the file is re-ingested. |
| database server down | Reset external_file record and retry the transfer when the server is available. |
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| failure during SWl01 | SWl01 terminates abnormally while processing a scene | check log file for these failures, and refer the problem to the person responsible for the SWl01 program |
| elements.dat file is not up to date (> 30 days old) | Make sure that the $ELEMENTS environment variable is pointing to the correct version of elements.dat. | |
| missing or corrupt navigational parameter files. | Make sure that the following environment variables are defined within
the SWl01 environment and that they point to valid, existing files.
|
|
| failure loading database with metadata files | metadata file is corrupt or missing | Look for the metadata file in the $L1A_META_DIR directory, and make sure it contains all of the required fields. |
| $L1A_META_DIR variable points to an invalid location | Check for the existence of the directory defined, and verify correct permission settings. | |
| Database server down | Wait for server to become available, and re-run SWl01process. |
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| no VDC scripts are created for a scene file | entrance directory already contains the maximum number of VDC scripts | Increase maximum number of scripts that can reside in the entrance directory ($MAKEVDC_MAX_FILES) |
| QC-status code is not a passed value | File must be passed by Cal/Val before it is allowed to be processed. Make sure that the catalog record reflects the failed status. | |
| priority code is not in the active-priority list | Use the VDC Priorities Editor to add the priority for the scene to the active priority list. | |
| assigned recipe is not in the recipes table | Use the recipe editor to add the recipe to the recipes table. | |
| template for recipe is missing | Create a template for the recipe. | |
| ancillary data could not be staged | Investigate why the ancillary data cannot be staged -- error in staging procedure, missing ancillary data records in database. | |
| wait-for-ancillary-data period has not expired | If an expiration time has been defined, a script will be created for the scene when that time arrives. Otherwise, no script for the file will be created until the ancillary data is made available. | |
| MakeVDC task is disabled | Enable the task using PROSTAT. |
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| copy to location fails | incorrect permission on destination directory | Set correct permissions on the destination directory. |
| no RCP privileges on destination host | Add an entry to the .rhosts file on the destination host. | |
| insufficient space in destincation directory | Allocate more disk resources. Reset the hurl_status or daac_transfer_flag field for the file in the corresponding table. | |
| database update fails | update would violate primary index constraint on duplicate rows | Eliminate the offending database record. |
| database server unreachable | Update the database records manually when the server is available. | |
| incorrect database-login information in configuration files | Verify the correctness of the following environment variables:
|
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| no files are transferred and DAAC task fails | DADS manager is unavailable | Contact the DAAC representative responsible for the DADS manager. Restart DAAC task when the DADS manager is available. |
| error reported for individual files | symbolic link is stale (does not point to an actual file) | Investigate the cause of the stale link and either remove it or re-create the link so it points to the file to be transferred. |
| Problem | Possible Causes | Possible Actions |
|---|---|---|
| Log file indicates 'file not within binning period' for a file on the edge of a dataday | Spacebinner and Timebinner use different criteria to establish binning period. The spacebinner uses dates and time to delimit datadays, while the timebinner uses an orbit range. This sometimes results in the spacebinner splitting dateline-crossing files that fall outside of the orbit range for a dataday. It happens most often during the extreme winter and summer months. | Reschedule the timebin job with the Timebin GUI but exclude the offending file. |
| Log file indicates 'file not within binning period' for a file in the middle of a day | Two-line elements are out of date. The spacebinner and timebinner use a program called getorb (aka bobdays) to derive the start and stop date and orbit ranges for a dataday. If the two-line elements are out of date, meaning that the most-recent element file is more than 2 weeks old, the results from getorb become increasingly less correct, and the spacebinner may tag scenes for the wrong dataday. When the timebinner attempts to bin these files, it fails because the file(s) fall outside of the orbit range for the dataday. | Update the two-line element files using the get_orbital_elements.csh
script. Make sure that the NORAD_SOURCE_NODE and the NORAD_SOURCE_DIR enviroment
variables are defined correctly.
New two-line elements must be propagated to secondary processing machines before processing resumes. |
| Scenes missing from timebin product | Two-line elements are out of date. This is the opposite from the above. Instead of the spacebinner incorrectly tagging scenes for a dataday, it omits files that should be included for a dataday. | Update the two-line element files using the get_orbital_elements.csh
script. Make sure that the NORAD_SOURCE_NODE and the NORAD_SOURCE_DIR enviroment
variables are defined correctly.
New two-line elements must be propagated to secondary processing machines before processing resumes. |
| Problems | Possible Causes | Possible Actions |
|---|---|---|
| VDC does not allocate an available CPU | day-of-week and/or time constraint on the CPU | Use the Resources GUI to adjust CPU availability |
| host name in resources table does not match host name in hosts table | Manually update the resources and/or hosts table so that the hostname fields are consistent. | |
| entrance program is not running | Start entrance. | |
| VDC starts more than one stream on the same CPU | CPU status was marked available while a stream was assigned to the CPU. This can be caused by killing streams while the current step is copying a file from the tape library. | Shut down VDC and kill/fail all streams on the CPU in question. Restart each of the failed streams. |
| Problems | Possible Causes | Possible Actions |
|---|---|---|
| VDC kills and restarts a stream step before the step has completed | allotted run time for step is insufficient | Use the recipe editor to increase the allotted run time. |
| step does not run | step is disabled | Use the recipe editor to enable the step. |
| step is marked 'BUSY', but no log file can be displayed | script associated with step encountered a fatal error: does not exist or has improper permissions | Check script for syntax errors. |
| insufficient rsh privileges on VDC host | Verify/add rsh permissions on VDC host. | |
| current step marked 'READY' but next step does not start | master program not running or is paused | Start/resume master. |
| QC code in activeproc table is not a PASSED_AUTO status | Investigate what caused QC status code to change. Either reset the status manually or kill/fail the stream. | |
| step marked 'ERROR' but stream is not failed automatically | insifficient commands in on_error.csh or on_error.csh does not exist in VDC working directory | Verify that the on-error script portion of the recipe template performs all required steps to fail a stream. If the on-error script is not present in the VDC working directory, fail the stream and restart it. |
| step errors if restarted | missing vdc.csh | Fail and restart the stream. |
| Possible Causes | Possible Actions |
|---|---|
| turned off manually | Check global log for shutdown messages. This is generally not a problem. |
| turned off by midnight changeover script | Check system processes to see if midnight_changeover.csh script is still running. If so, you need to determine what step within the script is hung up. That can be done with the UNIX ps command and noting the process ID (pid) of the midnight_changeover.csh process. Then look for other processes that are parented by midnight_changeover.csh. Apply that technique to each process until you find out exactly what is causing the problem. It may be necessary to kill one of the child processes to get the midnight_changeover.csh script to continue. |
| turned off by packup and distribute script | Check system processes to see if pkup_n_dist.csh is still running. If
so, follow a procedure similar to that described for the midnight changeover
scripts above, but use pkup_n_dist.csh in place of midnight_changeover.csh.
Note that pkup_n_dist.csh waits for busy streams to complete. If there are any streams marked busy (stream_status = b), the script will sleep until they have completed. Query the streams table in isql to determine whether there are any streams marked busy. VDCMON cannot necessarily see all streams that may be marked busy. The pkup_n_dist.csh script may have encountered a fatal error that caused the script to exit abnormally before it could turn the processing system back on. The log file for pkup_n_dist.csh should be available in the mail for the processing user: i.e. seawifsp, seawifst. If you use 'elm', the most-recent messages will be displayed first, which makes finding the specific message a little easier. You must determine what the cause of the fatal error was and what state the secondary machines are in before you restart the system. Directories could be missing from the secondary machines, which will cause errors within the processing system. |
| control table values updated with isql | Check contents of control table. Someone might have inadvertently reloaded the table with its initialization script. |
| Possible Causes | Possible Actions |
|---|---|
| System crash | If the VDC_PRIMARY_HOST machine crashes or goes down while the Scheduler and/or VDC are on, the Utility GUI will still indicate that the programs are running. This is because the GUI reads the values out of the control table, which is not updated if the system goes down unexpectedly. You can use the UNIX ps command to verify whether the programs are running. If not, simply start them with the Utility GUI. If they are running, examine their log files with the Utility GUI. The network may be hung up or the transaction log in the processing database may be full. |
| Program crash | One or more of the programs may have died due to an illegal operation (bus error, segmentation fault). Look for a core file in the src/corefiles directory. If the program continues to die each time it is started, further debugging is required to uncover the problem. |