Message boards : RALPH@home bug list : Bug reports for Ralph 5.37 through 5.40
Previous · 1 · 2 · 3 · 4 · 5 · Next
Author | Message |
---|---|
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
Not sure what to make of this. Two 5.37 WUs ran over the weekend. PC was up and running all weekend. I've got 24hr runtime preference on Ralph. Both seem to have failed after running for 2 days, but CPU time shows as 00:01:23 and :25 for these. 312611 312612 Here's the messages, which shows you the entire timeline: 11/4/2006 12:09:31 AM|ralph@home|Sending scheduler request to https://ralph.bakerlab.org/ralph_cgi/cgi 11/4/2006 12:09:31 AM|ralph@home|Reason: To fetch work 11/4/2006 12:09:31 AM|ralph@home|Requesting 207360 seconds of new work 11/4/2006 12:09:36 AM|ralph@home|Scheduler request succeeded 11/4/2006 12:09:38 AM|ralph@home|Started download of file rosetta_beta_5.37_windows_intelx86.exe 11/4/2006 12:09:38 AM|ralph@home|Started download of file hom001_aat364_03_05.200_v1_3.gz 11/4/2006 12:09:41 AM|ralph@home|Finished download of file hom001_aat364_03_05.200_v1_3.gz 11/4/2006 12:09:41 AM|ralph@home|Throughput 728542 bytes/sec 11/4/2006 12:09:41 AM|ralph@home|Started download of file hom001_aat364_09_05.200_v1_3.gz 11/4/2006 12:09:46 AM|ralph@home|Finished download of file hom001_aat364_09_05.200_v1_3.gz 11/4/2006 12:09:46 AM|ralph@home|Throughput 904946 bytes/sec 11/4/2006 12:09:46 AM|ralph@home|Started download of file hom001_t364_.fasta.gz 11/4/2006 12:09:47 AM|ralph@home|Finished download of file hom001_t364_.fasta.gz 11/4/2006 12:09:47 AM|ralph@home|Throughput 1127 bytes/sec 11/4/2006 12:09:47 AM|ralph@home|Started download of file t364_1_S_00001_0000007_0.pdb.gz 11/4/2006 12:09:48 AM|ralph@home|Finished download of file rosetta_beta_5.37_windows_intelx86.exe 11/4/2006 12:09:48 AM|ralph@home|Throughput 925633 bytes/sec 11/4/2006 12:09:48 AM|ralph@home|Finished download of file t364_1_S_00001_0000007_0.pdb.gz 11/4/2006 12:09:48 AM|ralph@home|Throughput 98559 bytes/sec 11/4/2006 12:09:48 AM|ralph@home|Started download of file t364_loopfile.gz 11/4/2006 12:09:48 AM|ralph@home|Started download of file paths_200_t364.txt 11/4/2006 12:09:50 AM|ralph@home|Finished download of file t364_loopfile.gz 11/4/2006 12:09:50 AM|ralph@home|Throughput 413 bytes/sec 11/4/2006 12:09:50 AM|ralph@home|Finished download of file paths_200_t364.txt 11/4/2006 12:09:50 AM|ralph@home|Throughput 9097 bytes/sec 11/4/2006 12:09:50 AM|ralph@home|Started download of file t364_1_S_00001_0000008_0.pdb.gz 11/4/2006 12:09:51 AM||Rescheduling CPU: files downloaded 11/4/2006 12:09:51 AM|ralph@home|Finished download of file t364_1_S_00001_0000008_0.pdb.gz 11/4/2006 12:09:51 AM|ralph@home|Throughput 99307 bytes/sec 11/4/2006 12:09:51 AM||Using earliest-deadline-first scheduling because computer is overcommitted. 11/4/2006 12:09:51 AM|rosetta@home|Pausing task FRA_t368_HOMOENVLOOPRLX_hom001_11_t368_11_dec01IGNORE_THE_REST_3_1329_45_0 (left in memory) 11/4/2006 12:09:51 AM|ralph@home|Starting task FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000007_0.pdb_1447_3_2 using rosetta_beta version 537 11/4/2006 12:09:51 AM||Suspending work fetch because computer is overcommitted. 11/4/2006 12:09:52 AM||Rescheduling CPU: files downloaded 11/4/2006 12:09:52 AM|rosetta@home|Pausing task BENCH_ABRELAX_SAVE_ALL_OUT_1iibA_BARCODE_R85_filters_1328_358_0 (left in memory) 11/4/2006 12:09:52 AM|ralph@home|Starting task FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000008_0.pdb_1447_3_2 using rosetta_beta version 537 11/4/2006 7:00:00 AM||Suspending network activity - time of day 11/4/2006 7:00:00 PM||Resuming network activity 11/4/2006 11:48:10 PM|rosetta@home|Sending scheduler request to https://boinc.bakerlab.org/rosetta_cgi/cgi 11/4/2006 11:48:10 PM|rosetta@home|Reason: To report completed tasks 11/4/2006 11:48:10 PM|rosetta@home|Reporting 2 tasks 11/4/2006 11:48:15 PM|rosetta@home|Scheduler request succeeded 11/5/2006 7:00:00 AM||Suspending network activity - time of day 11/5/2006 7:00:00 PM||Resuming network activity 11/6/2006 7:00:00 AM||Suspending network activity - time of day 11/6/2006 9:00:58 AM|ralph@home|Unrecoverable error for result FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000007_0.pdb_1447_3_2 ( - exit code -529697949 (0xe06d7363)) 11/6/2006 9:00:58 AM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds 11/6/2006 9:00:58 AM||Rescheduling CPU: application exited 11/6/2006 9:00:58 AM|ralph@home|Computation for task FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000007_0.pdb_1447_3_2 finished 11/6/2006 9:00:58 AM|rosetta@home|Resuming task BENCH_ABRELAX_SAVE_ALL_OUT_1iibA_BARCODE_R85_filters_1328_358_0 using rosetta version 536 11/6/2006 9:01:12 AM|ralph@home|Unrecoverable error for result FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000008_0.pdb_1447_3_2 ( - exit code -529697949 (0xe06d7363)) 11/6/2006 9:01:12 AM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds 11/6/2006 9:01:14 AM||Rescheduling CPU: application exited 11/6/2006 9:01:14 AM|ralph@home|Computation for task FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000008_0.pdb_1447_3_2 finished 11/6/2006 9:01:14 AM|rosetta@home|Resuming task FRA_t368_HOMOENVLOOPRLX_hom001_11_t368_11_dec01IGNORE_THE_REST_3_1329_45_0 using rosetta version 536 |
anders n Send message Joined: 16 Feb 06 Posts: 166 Credit: 131,419 RAC: 0 |
file_xfer_error One more. https://ralph.bakerlab.org/result.php?resultid=314848 Anders n |
wraith Send message Joined: 31 Oct 06 Posts: 4 Credit: 382 RAC: 0 |
This one ran for 4 hours and hung at 100%.. , but, could not be preempted... (in run state with no activity). I stopped and started BOINC and the W.U. restarted from 0.0 .... no checkpointing at all... Unfortunately, this restart lost the stdout/stderr files. I am allowing the unit to rerun in hopes of catching the out/error files on the next glitch. https://ralph.bakerlab.org/result.php?resultid=314941 |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I should also point out, when I returned to my PC, I had a message prompt from my ZoneAlarm firewall telling me that Rosetta Beta v5.37 was trying to use the internet and should I allow it... I wasn't here to respond, and so I presume it timed out. ...is there any way to force this debug communication attempt? So I could get a given version of the application setup in the firewall BEFORE a error causes it to try to use the net? |
Chu Volunteer moderator Project developer Project scientist Send message Joined: 26 Sep 06 Posts: 61 Credit: 12,545 RAC: 0 |
Thank you all for the help. We have already noticed that there are a lot of failures with error code -161 for the newly updated application 5.38. We are investigating the cause for it now... |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
V5.38 WU 315376 just crashed on my other machine. I just HAPPENED to be enlarging the native structure shown at the time of failure, so that was the first I'd brought up the graphic for this WU, rotated the lowest energy, then enlarged native and then crash and burn. exit code -1073741819 24hr RT preference. |
Chu Volunteer moderator Project developer Project scientist Send message Joined: 26 Sep 06 Posts: 61 Credit: 12,545 RAC: 0 |
Hi feet1st. That is really helpful. We have seen that error message ( in the debugging output) for quite a few times but no luck in finding a clue of what is the cause for that. Now for your reporting, we at least know it is somehow related to the graphic and I think it will help us a lot to investigate the real cause. V5.38 WU 315376 just crashed on my other machine. I just HAPPENED to be enlarging the native structure shown at the time of failure, so that was the first I'd brought up the graphic for this WU, rotated the lowest energy, then enlarged native and then crash and burn. |
Leffe Send message Joined: 19 Feb 06 Posts: 10 Credit: 3,683 RAC: 0 |
got 1 error: 06/11/2006 18:37:32|ralph@home|Computation for task fibril_abeta40_test1_1463_63_0 finished 06/11/2006 18:37:32|Spinhenge@home|Starting task fullerene2_4353_0 using metropolis version 242 06/11/2006 18:37:33|ralph@home|Unrecoverable error for result fibril_abeta40_test1_1463_63_0 (<file_xfer_error> <file_name>fibril_abeta40_test1_1463_63_0_0</file_name> <error_code>-161</error_code></file_xfer_error>) |
sslickerson Send message Joined: 15 Feb 06 Posts: 17 Credit: 4,006 RAC: 0 |
Error: Result 11/5/2006 9:10:01 PM|NanoHive@Home|Project is down 11/6/2006 12:47:28 PM|ralph@home|Sending scheduler request: To fetch work 11/6/2006 12:47:28 PM|ralph@home|Requesting 591 seconds of new work 11/6/2006 12:47:33 PM|ralph@home|Scheduler RPC succeeded [server version 505] 11/6/2006 12:47:33 PM|ralph@home|No work from project 11/6/2006 12:57:34 PM|ralph@home|Sending scheduler request: To fetch work 11/6/2006 12:57:34 PM|ralph@home|Requesting 296 seconds of new work 11/6/2006 12:57:39 PM|ralph@home|Scheduler RPC succeeded [server version 505] 11/6/2006 1:00:22 PM|ralph@home|Computation for task fibril_abeta40_test1_1463_20_1 finished 11/6/2006 1:00:22 PM||Starting fibril_abeta40_test1_1463_134_1 11/6/2006 1:00:22 PM|ralph@home|Starting task fibril_abeta40_test1_1463_134_1 using rosetta_beta version 538 11/6/2006 1:00:24 PM|ralph@home|Unrecoverable error for result fibril_abeta40_test1_1463_20_1 (<file_xfer_error> <file_name>fibril_abeta40_test1_1463_20_1_0</file_name> <error_code>-161</error_code></file_xfer_error>) 11/6/2006 1:17:12 PM|rosetta@home|Computation for task DOC_1CGI_R061030_st_model_06_1352_838_0 finished 11/6/2006 1:17:12 PM||Starting DOC_1CGI_R061030_st_model_07_1352_963_0 11/6/2006 1:17:12 PM|rosetta@home|Starting task DOC_1CGI_R061030_st_model_07_1352_963_0 using rosetta version 536 11/6/2006 1:17:14 PM|rosetta@home|Started upload of file DOC_1CGI_R061030_st_model_06_1352_838_0_0 11/6/2006 1:17:22 PM|rosetta@home|Finished upload of file DOC_1CGI_R061030_st_model_06_1352_838_0_0 |
Bruno Ramone Send message Joined: 29 Oct 06 Posts: 1 Credit: 78 RAC: 0 |
2006-11-06 21:03:29|ralph@home|Unrecoverable error for result FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000005_0.pdb_1447_1_2 (System nie mo¿e odnaleŸæ okreœlonej œcie¿ki. (0x3) - exit code 3 (0x3)) |
keltoi Send message Joined: 11 Aug 06 Posts: 1 Credit: 1,369 RAC: 0 |
11/6/2006 2:35:31 PM|ralph@home|Unrecoverable error for result FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000007_0.pdb_1447_10_2 (The system cannot find the path specified. (0x3) - exit code 3 (0x3)) 11/6/2006 2:35:31 PM|ralph@home|Computation for task FRA_t364_CASP7_hom001_1_IGNORE_THE_RESTt364_1_S_00001_0000007_0.pdb_1447_10_2 finished |
sslickerson Send message Joined: 15 Feb 06 Posts: 17 Credit: 4,006 RAC: 0 |
Error: Result 11/6/2006 1:47:17 PM|ralph@home|Computation for task fibril_abeta40_test1_1463_134_1 finished 11/6/2006 1:47:18 PM|ralph@home|Unrecoverable error for result fibril_abeta40_test1_1463_134_1 (<file_xfer_error> <file_name>fibril_abeta40_test1_1463_134_1_0</file_name> <error_code>-161</error_code></file_xfer_error>) 11/6/2006 1:47:18 PM|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds 11/6/2006 1:48:22 PM|ralph@home|Sending scheduler request: Requested by user 11/6/2006 1:48:22 PM|ralph@home|Reporting 1 tasks 11/6/2006 1:48:27 PM|ralph@home|Scheduler RPC succeeded [server version 505] |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
May I ask? I hope a good description can be written for when 5.38 comes to Rosetta... WHY is "...outputting structures with non-ideal backbone and sidehchain geometries" an improvement? I know, useful to the science... please explain more, on the surface it sounds to a layperson like a step backwards. Also, what impact will this have on the user experience? Will it mean we'll see larger upload sizes on results? |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
>>> The following workunits, whilst running till completion and finished, only produced 2 Decoys in over 5 hours each. In both cases they stayed at 1.00% for about 1 1/2 hours then went to 1.01% for possibly another 2 hours or more before finishing:- https://ralph.bakerlab.org/result.php?resultid=314543 https://ralph.bakerlab.org/result.php?resultid=314542 Ralph 5.38. |
JKeck {pirate} Send message Joined: 16 Feb 06 Posts: 14 Credit: 153,095 RAC: 0 |
It looks like all of the 5.38 tasks I have gotten have crashed. The output on the web page for mine is similar to what is already posted in this thread. One thing I have to add is that on a dual-core machine it was using ~98% of the CPU instead of the ~49% that it should have been. BOINC WIKI BOINCing since 2002/12/8 |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
>>> The following workunits, whilst running till completion and finished, only produced 2 Decoys in over 5 hours each. In both cases they stayed at 1.00% for about 1 1/2 hours then went to 1.01% for possibly another 2 hours or more before finishing:- Another 3 have failed https://ralph.bakerlab.org/result.php?resultid=314545 https://ralph.bakerlab.org/result.php?resultid=314601 https://ralph.bakerlab.org/result.php?resultid=314602 All 3 were minutes from completion with all doing more than 21100 seconds (runtime preference is 21600 seconds or 6 hours), when they errored out. Workunit 314602 did 184 Decoys yet I am told it is invalid. All had "error code -161" and were "file_xfer_error". |
zombie67 [MM] Send message Joined: 8 Aug 06 Posts: 75 Credit: 2,396,363 RAC: 6,299 |
Here are 4 (1x 5.38 & 3x 5.37) https://ralph.bakerlab.org/result.php?resultid=315272 5.38 </stderr_txt> <message> <file_xfer_error> <file_name>DOC_1STF_p2_fa_relax_from_native_1462_4_1_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> https://ralph.bakerlab.org/result.php?resultid=313951 5.37 <core_client_version>5.4.11</core_client_version> <message> The system cannot find the path specified. (0x3) - exit code 3 (0x3) </message> <stderr_txt> # random seed: 2872068 # cpu_run_time_pref: 21600 </stderr_txt> https://ralph.bakerlab.org/result.php?resultid=313010 5.37 <core_client_version>5.4.11</core_client_version> <stderr_txt> input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 0 starting structures built 29 (nstruct) times This process generated 0 decoys from 0 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> <message> <file_xfer_error> <file_name>1ogw__ETABLE_TEST_ABRELAX_rhh13sm6__1452_10_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> https://ralph.bakerlab.org/result.php?resultid=313009 5.37 core_client_version>5.4.11</core_client_version> <stderr_txt> input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 input_etable: reading etable... dsolv input_etable: WARNING etable types don't match! expected dsolv,606 got dsolv,721 Too many restarts with no progress. Keep application in memory while preempted. ====================================================== DONE :: 0 starting structures built 29 (nstruct) times This process generated 0 decoys from 0 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... </stderr_txt> <message> <file_xfer_error> <file_name>1n0u__ETABLE_TEST_ABRELAX_rhh13sm6__1452_10_0_0</file_name> <error_code>-161</error_code> </file_xfer_error> </message> Reno, NV Team: SETI.USA |
Snorre Send message Joined: 2 Nov 06 Posts: 2 Credit: 4,329 RAC: 0 |
Last three WUs returned crashed with the following errors: 07/11/2006 02:02:28|ralph@home|Unrecoverable error for result fibril_abeta40_test1_1463_152_0 (<file_xfer_error> <file_name>fibril_abeta40_test1_1463_152_0_0</file_name> <error_code>-161</error_code></file_xfer_error>) https://ralph.bakerlab.org/result.php?resultid=315577 07/11/2006 02:05:54|ralph@home|Unrecoverable error for result fibril_abeta40_test1_1463_133_1 (<file_xfer_error> <file_name>fibril_abeta40_test1_1463_133_1_0</file_name> <error_code>-161</error_code></file_xfer_error>) https://ralph.bakerlab.org/result.php?resultid=315723 07/11/2006 03:02:54|ralph@home|Unrecoverable error for result DOC_1BTH_p2_fa_relax_from_native_1462_1_2 (<file_xfer_error> <file_name>DOC_1BTH_p2_fa_relax_from_native_1462_1_2_0</file_name> <error_code>-161</error_code></file_xfer_error>) https://ralph.bakerlab.org/result.php?resultid=315833 |
wraith Send message Joined: 31 Oct 06 Posts: 4 Credit: 382 RAC: 0 |
This one ran for 4 hours and hung at 100%.. , but, could not be preempted... (in run state with no activity). Here is a capture of the second STDOUT... http://web.hotiron.net/boinc/ralph/314941/stdout.txt |
genes Send message Joined: 16 Feb 06 Posts: 45 Credit: 43,706 RAC: 20 |
Had some -161's: resultid=317218 resultid=317217 resultid=315313 a couple of "incorrect function" results: resultid=315015 resultid=315012 Here's a couple with a big dump: resultid=315014 resultid=315010 and a "downloading" one: resultid=314488 At least a few of these died while screensaver graphics were running. In general, I come back to the machine and see a different project's screensaver frozen on the screen, but it isn't running. The screensaver cannot be exited, but I can get the machine back with ctrl-alt-del. I look at task manager and see a Ralph WU's process listed, but using no CPU time. If I kill it, the screensaver graphics disappear and I get control of the machine back. The Ralph WU's status changes to "Computation error" and that's that. That being said, the very latest one (resultid=317218, a -161 error) happened while I was using the machine and no graphics were being displayed. Still have a couple of 5.38's left, no 5.39's yet. |
Message boards :
RALPH@home bug list :
Bug reports for Ralph 5.37 through 5.40
©2024 University of Washington
http://www.bakerlab.org