| Author | Message |
|
|
|
We fixed something in the interaction of Rosetta with BOINC to trigger more informative debugging messages upon crashes. Please continue to post what goes wrong!
____________
|
|
|
|
|
We fixed something in the interaction of Rosetta with BOINC to trigger more informative debugging messages upon crashes. Please continue to post what goes wrong!
We will, as soon we\'re able to get some. :-/
____________
"I'm trying to maintain a shred of dignity in this world." - Me
 |
|
|
|
|
|
How long has the site been \"Down for maintenance\"? Neal |
|
|
|
|
|
At least as long as this thread has been going, so at least 24 hours. Going to blow past workunit deadlines soon :( |
|
|
|
|
|
Looks like it \'went down\' sometime late friday. Maybe when they re-started the Rosetta server after the boinc upgrade they forgot to check on poor old second-cousin Ralphie! |
|
|
|
|
|
It was a database problem, and the database guy on our team was unavailable. Ralph\'s back!
Looks like it \'went down\' sometime late friday. Maybe when they re-started the Rosetta server after the boinc upgrade they forgot to check on poor old second-cousin Ralphie!
____________
|
|
|
|
|
|
Not sure if this is a bug or not, but in the graphics for this WU, the protein simply isn\'t there, though everything else appears to be running properly. Also, I\'m sure this has been mentioned (haven\'t been around much lately), but with most of the WU graphics, when opened a second (and then third, etc) time, the information at the bottom is shifted down so that the bottom line (the project URL & Accepted Energy) is not visible. Other than that, everything looks good so far.
____________
 |
|
|
|
|
|
1)I don\'t know if this was not mentioned early, but the progres counting don\'t work well.
6 min ... 1.020%
24 min ... 1.041% (8:21:56 to completion)
40 min ... 1.042% (8:38:22 to completion)
1:09:30 ... completion
(I saw this also with 5.22)
2)More memory need?
12/06/2006 09:42:53|ralph@home|Message from server: Your computer has only 402116608 bytes of memory; workunit requires 97883392 more bytes
____________
|
|
|
|
|
|
Sadir: the percentage complete thing is perfectly normal, the first model just took that long.
no errors with 5.23 so far here, couple of successfull WUs finished. |
|
|
|
|
|
I have a funny (interesting) one. One my laptop (which has been pretty much flawless at Ralph, as opposed to my AMD64 3700 sandiego which experiences the \"fatal windows\" error) I\'ve seen something happen twice in 24 hours. I see either the Rosetta 5.22 screensaver or the Ralph 5.23 screensaver will show on my window when I return from some personal task. the graphic will NOT go away by moving a mouse or pressing a key. I had another window open but couldn\'t see it. The mouse would still work on the unseen graphic if I just clicked all over I could hear it interacting, but the Rosetta Graphic would not release my screen. I ended up pressing the power button on both occasions, only to see the HD activity light blink and hear the windows log off Wav, but the Rosetta graphic was still on the screen all the way to Shutdown when the screen when dead.
Since mine is the only report of this, it was on both Rosetta and Ralph, and hasn\'t happened with the laptop before, I will be doing some adware/malware/virus/others scans to see if the problem is on my end.
tony |
|
|
|
|
|
Strange error in FRA_t316_CASP7_hom001_1_IGNORE_THE_RESTt316_1_PROTINFO-AB_TS1.pdb_666_2_0 .
At first it was running normally but several Simap WUs had errors. Later strange error message has appeared. Something I\'ve never seen before:
\"Runtime error!
Program: ...alph.bakerlab.org\\rosetta_beta_5.23_windows_intelx86.exe
This application has requested the Runtime to terminate it in unusual way. Please contact the application\'s support team for more information.\"
Screenshot is here: Ralph_error.gif (7.76KB)
My messages:
14. 6. 2006 10:50:47|ralph@home|Unrecoverable error for result FRA_t316_CASP7_hom001_1_IGNORE_THE_RESTt316_1_PROTINFO-AB_TS1.pdb_666_2_0 (The system cannot find the path specified. (0x3) - exit code 3 (0x3))
14. 6. 2006 10:50:47|ralph@home|Deferring scheduler requests for 1 minutes and 0 seconds
14. 6. 2006 10:50:47||Rescheduling CPU: application exited
14. 6. 2006 10:50:47|ralph@home|Computation for task FRA_t316_CASP7_hom001_1_IGNORE_THE_RESTt316_1_PROTINFO-AB_TS1.pdb_666_2_0 finished
Full message log: Ralph_error_log_14june2006.txt (40KB)
Computer where this error happened is PIII 500MHz, 160MB RAM, WinXP Home SP2, running only antivirus and Boinc with Simap and Ralph. (It has 512MB virtual memory, what is probably not enough for some bigger WUs)
After this WU was finished (with error), Simap stopped to make errors and finished next WU successfully. |
|
|
|
|
|
Rom had mentioned there might be a fix to the fatal windows errors in 5.23. When it was released, I set the box I usually got these errors with to NNW/NNT for all other projects and suspended them, so I\'d run nothing but 5.23. I\'m not ready to say \"it\'s Fixed\", but so far it sure looks good.
177824 158140 14 Jun 2006 19:37:15 UTC 15 Jun 2006 10:43:14 UTC Over Success Done 13,545.38 53.30 53.30
177438 157770 14 Jun 2006 15:16:32 UTC 15 Jun 2006 7:36:07 UTC Over Success Done 14,102.19 55.50 55.50
176725 156687 14 Jun 2006 7:37:56 UTC 15 Jun 2006 1:10:40 UTC Over Success Done 13,275.25 52.24 52.24
176356 156712 14 Jun 2006 3:52:59 UTC 14 Jun 2006 20:28:26 UTC Over Success Done 13,985.28 53.74 53.74
175612 151093 13 Jun 2006 20:01:42 UTC 14 Jun 2006 16:47:14 UTC Over Success Done 14,084.34 54.12 54.12
174950 155410 13 Jun 2006 13:25:22 UTC 14 Jun 2006 15:16:32 UTC Over Success Done 14,111.06 54.22 54.22
174529 155065 13 Jun 2006 9:17:55 UTC 14 Jun 2006 7:37:56 UTC Over Success Done 14,101.56 54.18 54.18
174341 154879 13 Jun 2006 6:09:29 UTC 14 Jun 2006 3:52:59 UTC Over Success Done 14,346.30 55.12 55.12
173772 154359 12 Jun 2006 22:33:06 UTC 13 Jun 2006 20:01:42 UTC Over Success Done 14,103.70 54.19 54.19
173541 154155 12 Jun 2006 19:11:15 UTC 13 Jun 2006 10:39:56 UTC Over Success Done 14,441.13 55.49 55.49
170677 146450 11 Jun 2006 22:52:42 UTC 13 Jun 2006 9:17:55 UTC Over Success Done 13,161.97 50.57 50.57
170660 146482 11 Jun 2006 22:52:42 UTC 13 Jun 2006 3:03:42 UTC Over Success Done 14,275.66 54.85 54.85
170659 146481 11 Jun 2006 22:52:42 UTC 12 Jun 2006 8:30:20 UTC Over Success Done 13,858.98 53.25 53.25
|
|
|
|
|
|
is this WU a troubled wu? FRA_t301_hom001_1_LOOPRLX_IGNORE_THE_REST__hom001_1_1bwzA__100_701_23_0
I have left it running now for 1:34:10 and it is only at 1.623% and finish time is up to 3333:00:01 now and still climbing
____________
|
|
|
|
|
is this WU a troubled wu? FRA_t301_hom001_1_LOOPRLX_IGNORE_THE_REST__hom001_1_1bwzA__100_701_23_0
I have left it running now for 1:34:10 and it is only at 1.623% and finish time is up to 3333:00:01 now and still climbing
never mind it just hit 100% will I was typing this into the forums so I am not sure what happened as of yet time was 1:37:27 total
____________
|
|
|
|
|
is this WU a troubled wu? FRA_t301_hom001_1_LOOPRLX_IGNORE_THE_REST__hom001_1_1bwzA__100_701_23_0
I have left it running now for 1:34:10 and it is only at 1.623% and finish time is up to 3333:00:01 now and still climbing
the progress indicator isn\'t linear. What you\'ll see are jumps in percetage. All WUs start at 1% and slowly proceed higher until one model is done. Then it jumps to another percentage and the points to the right of the decimal slowly proceed again until the next model is done. I.E you might see this if you checked the status every 10 min: 1.000, 1.0001, 1.0002, 1.003, 12.000, 12.001, 12.002, 12.003, 24.000, 24.001 etc etc until the time runs out where it jumps to 100%, uploads and reports.
The number of models depends on protein size, and puter speeds (for the most part). Every WU will run atleast ONE model regardless of time (except where terminated by \"watchdog timer\").
does this help?
tony
|
|
|
|
|
is this WU a troubled wu? FRA_t301_hom001_1_LOOPRLX_IGNORE_THE_REST__hom001_1_1bwzA__100_701_23_0
I have left it running now for 1:34:10 and it is only at 1.623% and finish time is up to 3333:00:01 now and still climbing
the progress indicator isn\'t linear. What you\'ll see are jumps in percetage. All WUs start at 1% and slowly proceed higher until one model is done. Then it jumps to another percentage and the points to the right of the decimal slowly proceed again until the next model is done. I.E you might see this if you checked the status every 10 min: 1.000, 1.0001, 1.0002, 1.003, 12.000, 12.001, 12.002, 12.003, 24.000, 24.001 etc etc until the time runs out where it jumps to 100%, uploads and reports.
The number of models depends on protein size, and puter speeds (for the most part). Every WU will run atleast ONE model regardless of time (except where terminated by \"watchdog timer\").
does this help?
tony
yes and watchdog did terminate that Wu on this computer,
<core_client_version>5.4.9</core_client_version>
<stderr_txt>
# random seed: 2998884
# cpu_run_time_pref: 3600
# DONE :: 1 starting structures built 0 (nstruct) times
# This process generated 1 decoys from 1 attempts
# 0 starting pdbs were skipped
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
</stderr_txt>
is the results of it.
and this is computer so I do not know if this will help in the project
CPU type GenuineIntel
Intel(R) Celeron(R) CPU 2.60GHz
Number of CPUs 1
Operating System Microsoft Windows XP
Home Edition, Service Pack 2, (05.01.2600.00)
Memory 510.98 MB
Cache 976.56 KB
Swap space 1248.2 MB
Total disk space 37.26 GB
Free Disk Space 22.39 GB
Measured floating point speed 1329.23 million ops/sec
Measured integer speed 2718.87 million ops/sec
Average upload rate 1.41 KB/sec
Average download rate 132.77 KB/sec
____________
|
|
|
|
|
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
This is normal, it just means the wu completed successfully and didn\'t need the watchdog, so it shut it down.
If you\'re talking about wuid=160649, then it completed sucessfully and has been credited. See the \"result ID\" for that WU below
Result ID 180410
Name FRA_t301_hom001_1_LOOPRLX_IGNORE_THE_REST__hom001_1_1bwzA__100_701_23_0
Workunit 160649
Created 15 Jun 2006 22:54:54 UTC
Sent 15 Jun 2006 23:51:21 UTC
Received 16 Jun 2006 14:42:46 UTC
Server state Over
Outcome Success
Client state Done
Exit status 0 (0x0)
Computer ID 2910
Report deadline 19 Jun 2006 23:51:21 UTC
CPU time 5847.390625
stderr out <core_client_version>5.4.9</core_client_version>
<stderr_txt>
# random seed: 2998884
# cpu_run_time_pref: 3600
# DONE :: 1 starting structures built 0 (nstruct) times
# This process generated 1 decoys from 1 attempts
# 0 starting pdbs were skipped
BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...
</stderr_txt>
Validate state Valid
Claimed credit 13.6983915054668
Granted credit 13.6983915054668
application version 5.23
if the protein is huge, your puter old, or your runtime is set low, then this is what you should be seeing with your future wus. It\'s helping.
tony |
|
|
|
|
Rom had mentioned there might be a fix to the fatal windows errors in 5.23. When it was released, I set the box I usually got these errors with to NNW/NNT for all other projects and suspended them, so I\'d run nothing but 5.23. I\'m not ready to say \"it\'s Fixed\", but so far it sure looks good.
177824 158140 14 Jun 2006 19:37:15 UTC 15 Jun 2006 10:43:14 UTC Over Success Done 13,545.38 53.30 53.30
177438 157770 14 Jun 2006 15:16:32 UTC 15 Jun 2006 7:36:07 UTC Over Success Done 14,102.19 55.50 55.50
176725 156687 14 Jun 2006 7:37:56 UTC 15 Jun 2006 1:10:40 UTC Over Success Done 13,275.25 52.24 52.24
176356 156712 14 Jun 2006 3:52:59 UTC 14 Jun 2006 20:28:26 UTC Over Success Done 13,985.28 53.74 53.74
175612 151093 13 Jun 2006 20:01:42 UTC 14 Jun 2006 16:47:14 UTC Over Success Done 14,084.34 54.12 54.12
174950 155410 13 Jun 2006 13:25:22 UTC 14 Jun 2006 15:16:32 UTC Over Success Done 14,111.06 54.22 54.22
174529 155065 13 Jun 2006 9:17:55 UTC 14 Jun 2006 7:37:56 UTC Over Success Done 14,101.56 54.18 54.18
174341 154879 13 Jun 2006 6:09:29 UTC 14 Jun 2006 3:52:59 UTC Over Success Done 14,346.30 55.12 55.12
173772 154359 12 Jun 2006 22:33:06 UTC 13 Jun 2006 20:01:42 UTC Over Success Done 14,103.70 54.19 54.19
173541 154155 12 Jun 2006 19:11:15 UTC 13 Jun 2006 10:39:56 UTC Over Success Done 14,441.13 55.49 55.49
170677 146450 11 Jun 2006 22:52:42 UTC 13 Jun 2006 9:17:55 UTC Over Success Done 13,161.97 50.57 50.57
170660 146482 11 Jun 2006 22:52:42 UTC 13 Jun 2006 3:03:42 UTC Over Success Done 14,275.66 54.85 54.85
170659 146481 11 Jun 2006 22:52:42 UTC 12 Jun 2006 8:30:20 UTC Over Success Done 13,858.98 53.25 53.25
OK, OK I\'m convinced. I haven\'t had any errors of any kind with 5.23. The following WU can be added to the list of successes for my error prone puter:
181942 162058 16 Jun 2006 18:44:02 UTC 17 Jun 2006 10:47:32 UTC Over Success Done 14,364.88 56.53 56.53
181682 161805 16 Jun 2006 14:44:34 UTC 17 Jun 2006 4:16:03 UTC Over Success Done 13,954.83 54.92 54.92
181002 161170 16 Jun 2006 8:41:29 UTC 17 Jun 2006 0:34:09 UTC Over Success Done 13,478.86 53.04 53.04
180751 160956 16 Jun 2006 4:39:34 UTC 16 Jun 2006 20:44:26 UTC Over Success Done 13,756.77 54.14 54.14
180432 151974 15 Jun 2006 23:29:54 UTC 16 Jun 2006 15:04:45 UTC Over Success Done 14,181.53 55.81 55.81
179254 159529 15 Jun 2006 11:09:29 UTC 16 Jun 2006 11:14:50 UTC Over Success Done 14,375.77 56.57 56.57
178840 159127 15 Jun 2006 7:36:07 UTC 16 Jun 2006 8:41:29 UTC Over Success Done 14,242.95 56.05 56.05
178558 158860 15 Jun 2006 4:15:02 UTC 15 Jun 2006 23:29:54 UTC Over Success Done 13,769.56 54.19 54.19
178127 158438 14 Jun 2006 23:29:32 UTC 15 Jun 2006 20:00:12 UTC Over Success Done 14,343.39 56.44 56.44
I\'ve set the other projects back to \"allow new work\" and \"resumed\" them. THanks for fixing this |
|
|
|
|
1)I don\'t know if this was not mentioned early, but the progres counting don\'t work well.
6 min ... 1.020%
24 min ... 1.041% (8:21:56 to completion)
40 min ... 1.042% (8:38:22 to completion)
1:09:30 ... completion
(I saw this also with 5.22)
2)More memory need?
12/06/2006 09:42:53|ralph@home|Message from server: Your computer has only 402116608 bytes of memory; workunit requires 97883392 more bytes
I think there is no need for more momory -:)
Jobs are finishing OK, w/o any errors, in normal run time !
What is need is to \"fix\" this misleading message on both: (ralph and rosetta)
ps: I had sucessfully run rosetta with 64 MB RAM
*Only with that low ram I get about 70 page faults by second
but I am getting above message on PCs with 256 MB physical RAM -:(
Yeah !
Really this %done indicator is not working smooth
ps: on rosetta too,
I had already aborted a WU by hand, cause this;I believed that it was stuck -:(
It was with about 3 hours cpu time, at 1.00n%
My preference run time is 2 hours, and boincview was prediting 190 hours
of cpu time to completion -;
ps: I really dont want to run a single job by 190 Hours :!:
Thanks
____________
Click signature for global team stats
  |
|
|
|
|
1)I don\'t know if this was not mentioned early, but the progres counting don\'t work well.
6 min ... 1.020%
24 min ... 1.041% (8:21:56 to completion)
40 min ... 1.042% (8:38:22 to completion)
1:09:30 ... completion
(I saw this also with 5.22)
2)More memory need?
12/06/2006 09:42:53|ralph@home|Message from server: Your computer has only 402116608 bytes of memory; workunit requires 97883392 more bytes
I think there is no need for more momory -:)
Jobs are finishing OK, w/o any errors, in normal run time !
What is need is to \"fix\" this misleading message on both: (ralph and rosetta)
ps: I had sucessfully run rosetta with 64 MB RAM
*Only with that low ram I get about 70 page faults by second
but I am getting above message on PCs with 256 MB physical RAM -:(
Yeah !
Really this %done indicator is not working smooth
ps: on rosetta too,
I had already aborted a WU by hand, cause this;I believed that it was stuck -:(
It was with about 3 hours cpu time, at 1.00n%
My preference run time is 2 hours, and boincview was prediting 190 hours
of cpu time to completion -;
ps: I really dont want to run a single job by 190 Hours :!:
Thanks
I think Ralph needs to work on the Progress bar area of the program. I went thur no problems, It was the first time I saw that from this project is all. so I thought I would say something but if that is normal then nothing to worry about.
____________
|
|
|
|
|
|
What happend here?
http://ralph.bakerlab.org/result.php?resultid=179470
Runntime of 18600 sec on 7200 pref.
Anders n
____________
|
|
|
|
|
|
An error happened in this result.
http://ralph.bakerlab.org/result.php?resultid=182521
...
<message>
- exit code -1073741819 (0xc0000005)
</message>
...
I fear this happened after pc was back from suspend mode. but i\'m not very sure.
{edit}
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0061E868 read attempt to address 0xF8F3E75E
____________
|
|
|
|
|
|
[deleted]
|
|
|
|
|
|
another one:
http://ralph.bakerlab.org/result.php?resultid=182583
<stderr_txt>
# random seed: 2996939
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
# cpu_run_time_pref: 3600
ERROR:: Exit at: .\\dock_structure.cc line:401
</stderr_txt>
____________
|
|
|
|
|
|
Thanks, that\'s actually a very useful result for us, because the error triggered the display of a detailed stacktrace!
An error happened in this result.
http://ralph.bakerlab.org/result.php?resultid=182521
...
<message>
- exit code -1073741819 (0xc0000005)
</message>
...
I fear this happened after pc was back from suspend mode. but i\'m not very sure.
{edit}
Unhandled Exception Detected...
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x0061E868 read attempt to address 0xF8F3E75E
____________
|
|
|
|
|
|
I got a error here.
http://ralph.bakerlab.org/result.php?resultid=184229
Looks like the watch dog did it´s job.
<core_client_version>5.4.9</core_client_version>
<message>
Ett eller flera argument 䲠ogiltiga (0x80000003) - exit code -2147483645 (0x80000003)
</message>
<stderr_txt>
# random seed: 3013987
# cpu_run_time_pref: 7200
**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 30437.5 seconds. Greater than 4X preferred time: 7200 seconds
**********************************************************************
Anders n
____________
|
|
|