Bug reports for 5.90/5.91

Message boards : RALPH@home bug list : Bug reports for 5.90/5.91

To post messages, you must log in.

AuthorMessage
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 3554 - Posted: 20 Dec 2007, 1:24:24 UTC

Update to 5.90! Keep us posted on weirdness.
ID: 3554 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 3556 - Posted: 20 Dec 2007, 11:04:35 UTC

I am currently running 3 work units of the "1px8" variety.
They are behaving very weirdly.
I checked my computer and saw that only one WU was progressing in time and percentage done (a CPDN WU) and no other counters were moving.
I have 4 cores and Boinc Manager said that my 3 other cores were all running at High Priority on my 3 Ralph work units.
No CPU Time done, No Percent Done and No Time Left counters were moving.
In fact The CPU Time was on 0.00% as was Percent done.
Suspending and restarting made no difference.
Stopped Boinc and restarted Boinc Manager and I now had the RALPH WU's showing over and hour had been completed and between 18% and 30% had been completed.
But the counters still are not moving, only when you stop and restart the manager do you see what progress has been done.

The WU's are still running at the moment 715097
715153
715237

I checked my system (Linux) and found all 4 cores are running so the WU's have not hung and are actually processing, it is just that I can't see what is happening.

ID: 3556 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 3558 - Posted: 20 Dec 2007, 18:38:44 UTC - in response to Message 3556.  

Hi Conan: Thanks, this is interesting and unexpected! I forgot to mention in the home page that we also updated the version of the BOINC api that is used to compile Rosetta@home, in anticipation of a new release of the BOINC client software (6.0). We'll look into whether this causes a problem with the reporting of % complete, etc.


I am currently running 3 work units of the "1px8" variety.
They are behaving very weirdly.
I checked my computer and saw that only one WU was progressing in time and percentage done (a CPDN WU) and no other counters were moving.
I have 4 cores and Boinc Manager said that my 3 other cores were all running at High Priority on my 3 Ralph work units.
No CPU Time done, No Percent Done and No Time Left counters were moving.
In fact The CPU Time was on 0.00% as was Percent done.
Suspending and restarting made no difference.
Stopped Boinc and restarted Boinc Manager and I now had the RALPH WU's showing over and hour had been completed and between 18% and 30% had been completed.
But the counters still are not moving, only when you stop and restart the manager do you see what progress has been done.

The WU's are still running at the moment 715097
715153
715237

I checked my system (Linux) and found all 4 cores are running so the WU's have not hung and are actually processing, it is just that I can't see what is happening.


ID: 3558 · Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 25 Feb 07
Posts: 27
Credit: 77,464
RAC: 0
Message 3559 - Posted: 21 Dec 2007, 6:04:57 UTC

Rosetta Thread

I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).
ID: 3559 · Report as offensive    Reply Quote
BigMike
Avatar

Send message
Joined: 23 Feb 06
Posts: 63
Credit: 58,730
RAC: 0
Message 3560 - Posted: 21 Dec 2007, 6:04:58 UTC

This isn't really a bug, but more of an item that Rosetta@Home users will probably go nuts about. I suspect it has to do with the new BOINC API.

Recent WU's have an interesting line that says ERRORS: CANCELLED.

Any idea what it means? Might be useful to put it in the FAQs if it's always going to say that.

==Mike
Don't believe everything you think.
ID: 3560 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 3562 - Posted: 21 Dec 2007, 13:38:59 UTC - in response to Message 3558.  

Hi Conan: Thanks, this is interesting and unexpected! I forgot to mention in the home page that we also updated the version of the BOINC api that is used to compile Rosetta@home, in anticipation of a new release of the BOINC client software (6.0). We'll look into whether this causes a problem with the reporting of % complete, etc.


I am currently running 3 work units of the "1px8" variety.
They are behaving very weirdly.
I checked my computer and saw that only one WU was progressing in time and percentage done (a CPDN WU) and no other counters were moving.
I have 4 cores and Boinc Manager said that my 3 other cores were all running at High Priority on my 3 Ralph work units.
No CPU Time done, No Percent Done and No Time Left counters were moving.
In fact The CPU Time was on 0.00% as was Percent done.
Suspending and restarting made no difference.
Stopped Boinc and restarted Boinc Manager and I now had the RALPH WU's showing over and hour had been completed and between 18% and 30% had been completed.
But the counters still are not moving, only when you stop and restart the manager do you see what progress has been done.

The WU's are still running at the moment 715097
715153
715237

I checked my system (Linux) and found all 4 cores are running so the WU's have not hung and are actually processing, it is just that I can't see what is happening.




Thanks Rhiju,
My next 4 WU's of the "s099" type are also doing the same thing.
Running at "High Priority" and showing nothing in Boinc Manager as to how it is progressing. This is happening on a different computer.
ID: 3562 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 3563 - Posted: 21 Dec 2007, 15:30:51 UTC
Last modified: 21 Dec 2007, 15:32:40 UTC

There appears to be a thread on the Linux CPU time issue in the BOINC Projects list:

(user ID required)
[boinc_projects] CPU Time not updated under linux
ID: 3563 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 3567 - Posted: 21 Dec 2007, 21:38:38 UTC

It would appear that no matter what time you have in your preferances it will not be adhered to.
I had 4 Ralph WU's running when I went to bed (no idea how long they had been running as nothing was showing), all were at high priority.
They were still running this morning so I stopped BM and restarted, this made the WU's immediatly report and give their stats.
They had run for 19 to 20 hours and probably would of kept running till they were stopped. They produced 11 to 13 decoys.

718960
719023
719030
719087
ID: 3567 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 3568 - Posted: 21 Dec 2007, 21:50:53 UTC - in response to Message 3567.  

The linux executable should behave better now. Thanks to Conan others for the "front-lines" warning from ralph! Let us know how its working.


It would appear that no matter what time you have in your preferances it will not be adhered to.
I had 4 Ralph WU's running when I went to bed (no idea how long they had been running as nothing was showing), all were at high priority.
They were still running this morning so I stopped BM and restarted, this made the WU's immediatly report and give their stats.
They had run for 19 to 20 hours and probably would of kept running till they were stopped. They produced 11 to 13 decoys.

718960
719023
719030
719087


ID: 3568 · Report as offensive    Reply Quote
Rhiju
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 14 Feb 06
Posts: 161
Credit: 3,725
RAC: 0
Message 3569 - Posted: 21 Dec 2007, 21:51:53 UTC - in response to Message 3560.  

I didn't see the message in, e.g., the stderr -- where is it showing up?

This isn't really a bug, but more of an item that Rosetta@Home users will probably go nuts about. I suspect it has to do with the new BOINC API.

Recent WU's have an interesting line that says ERRORS: CANCELLED.

Any idea what it means? Might be useful to put it in the FAQs if it's always going to say that.

==Mike


ID: 3569 · Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 25 Feb 07
Posts: 27
Credit: 77,464
RAC: 0
Message 3571 - Posted: 22 Dec 2007, 6:16:26 UTC - in response to Message 3559.  
Last modified: 22 Dec 2007, 6:17:08 UTC

Rosetta Thread

I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).


Update: the Ralph 5.90 task is now running for 48.5 hours (over 2 days). Is there are anything that will stop those 5.90 tasks from running indefinitely ?

What happens when the task/workunit reaches the Report Deadline (other then no longer getting credit for it) ? Will that stop those 5.90 workunits or will they still continue on ?

ID: 3571 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 3572 - Posted: 22 Dec 2007, 13:07:20 UTC - in response to Message 3571.  

Rosetta Thread

I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours).


Update: the Ralph 5.90 task is now running for 48.5 hours (over 2 days). Is there are anything that will stop those 5.90 tasks from running indefinitely ?

What happens when the task/workunit reaches the Report Deadline (other then no longer getting credit for it) ? Will that stop those 5.90 workunits or will they still continue on ?



Hello Thomas,
The ones that I had only stopped when I stopped Boinc Manager and then restarted it. The Work Units then stopped and uploaded their results, also giving time taken and other stats.
I did not have any of the ones I had error by doing this.

Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works.



ID: 3572 · Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 25 Feb 07
Posts: 27
Credit: 77,464
RAC: 0
Message 3573 - Posted: 22 Dec 2007, 20:45:20 UTC - in response to Message 3572.  


Hello Thomas,
The ones that I had only stopped when I stopped Boinc Manager and then restarted it. The Work Units then stopped and uploaded their results, also giving time taken and other stats.

Hi Conan, it appears that restarting Boinc is indeed the only way to dislodge one of those stuck 5.90 workunits. Left unattended those 5.90 workunits will continue indefinitely!. Unfortunately restarting Boinc is not a complete fix either: as soon as Boinc starts up again it will process the next 5.90 that it still has in its queue.

I did not have any of the ones I had error by doing this.

Out of around 250 stuck 5.90 workunits less than 10 have made it back to the Rosetta servers on their own (some with 299 and more decoys completed). I don't know why those finished when most others don't, but most of those that did finish failed in the Rosetta validator (presumably too little cpu time such as 0.0004 seconds reported to be credible ?). On the other hand two of them did get a large amount of credit for the high number of completed decoys.

The two 5.90 workunits on my home server (one Ralph, one Rosetta) both passed the Validator and did get huge amounts of credit when I restarted Boinc on that machine. More importantly the amount of cpu time reported back to Rosetta was correct for those two workunits! At least part of the Boinc/Rosetta system must be aware of the correct amount of time spend on the workunit and record it properly so that this amount of cpu time spend is available on restart.


Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works.

I wish that was true, but I have proof to the contrary. That Ralph 5.90 task was running 50 hours which is a lot more than 6 times the preference time of 4 hours.

I believe the only options available to cleanup this mess for Linux users are:
Option A:
1.) Where possible, use Boinc Manager to abort all 5.90 workunits that are in "Ready to Start" state. This is necessary so that they will not become stuck too!
2.) Stop and restart the Boinc client. This should immediately send home all those 5.90 workunits that have been in progress and that continued to run beyond the allocated preference time. Those workunits should validate fine and will contribute to the Rosetta Science (the time was not wasted).
3.) Use Boinc Manager to verify that all remaining 5.90 tasks in the system are now in the "Ready to Report" state. Note: if you have a large Boinc cache and/or run multiple Boinc projects there is a possibility of a currently running 5.90 workunit not yet having exceeded the preferred runtime. In this case you will need to restart Boinc again when that workunit has exceeded its preferred run time.

Option B:
If you are unable to use the Boinc Manager gui (perhaps on headless servers in remote locations) you need to reset the Rosetta project. Within the BOINC directory run "./boinc_cmd --project https://boinc.bakerlab.org/rosetta reset". Note: this will remove all workunits and clients for the Rosetta project. This means all completed, but not yet reported work and all work in progress for this project will be lost! The boinc client will perform a fresh load of 5.91 workunits along with the 5.91 client as a result of the reset operation.
ID: 3573 · Report as offensive    Reply Quote
Hans Sveen

Send message
Joined: 17 Feb 06
Posts: 11
Credit: 386,241
RAC: 51
Message 3579 - Posted: 25 Dec 2007, 14:03:09 UTC

Hello and Merry Christmas!

I tried the show graphics is run, I saw something "funny"; it showed only the native foldig window. The others were empty! An other peculiar behavior is that the accepted Energy line is linear; never seen that before!
It seems to run well behaved so far, so just for Your information.

Screenshot

Thank You!


Hans Sveen
Oslo, Norway

ID: 3579 · Report as offensive    Reply Quote
ramostol

Send message
Joined: 29 Mar 07
Posts: 24
Credit: 31,121
RAC: 0
Message 3580 - Posted: 27 Dec 2007, 9:30:17 UTC - in response to Message 3572.  

Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works.


Actually 4 times its preference time. Observed and proven.
ID: 3580 · Report as offensive    Reply Quote
Thomas Leibold

Send message
Joined: 25 Feb 07
Posts: 27
Credit: 77,464
RAC: 0
Message 3581 - Posted: 28 Dec 2007, 5:32:47 UTC - in response to Message 3580.  



Actually 4 times its preference time. Observed and proven.


Correct, but probably not in the sense you meant it: it has been proven that the 5.90 workunits (with the defect Linux client) will NOT terminate by themselves regardless how much they exceed the preferred run time!

While cleaning up the 5.90 mess I did find that this is apparently not entirely new either. On one server I found a stuck 5.81 workunit that had not only exceeded its preferred run time over ONE HUNDRED TIMES, but also exceeded the workunit deadline by a MONTH!

It would not surprise me if the failure to terminate those kinds of rogue workunits is in some way related to the longstanding issues regarding Ralph/Rosetta watchdog on Linux (e.g.: timing dependencies during watchdog initiated client shutdown, corrupting the memory allocation chain by freeing the same block of memory twice resulting in subsequent segmentation faults which sometimes crashes the watchdog and fails to terminate the client IIRC).
ID: 3581 · Report as offensive    Reply Quote
Dr Who Fan
Avatar

Send message
Joined: 2 Sep 06
Posts: 76
Credit: 107,857
RAC: 0
Message 3582 - Posted: 29 Dec 2007, 3:37:04 UTC
Last modified: 29 Dec 2007, 3:38:10 UTC

When this one crashed it caused BOINC to stop working and also triggered the Windows runtime debugger/crash reporter.

Name: 1zpy__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1zpy_-native__2756_503_1

It had a CPU time of 10077.23 seconds


Here part of the dump info from BOINC:
<core_client_version>6.1.0</core_client_version>
<![CDATA[
<message>
- exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# cpu_run_time_pref: 7200
# random seed: 1564821

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00BA900B read attempt to address 0x12443000

Engaging BOINC Windows Runtime Debugger...
********************

*** Dump of the Process Statistics: ***

- I/O Operations Counters -
Read: 5007, Write: 0, Other 10312

- I/O Transfers Counters -
Read: 0, Write: 186417, Other 0

- Paged Pool Usage -
QuotaPagedPoolUsage: 28848, QuotaPeakPagedPoolUsage: 35360
QuotaNonPagedPoolUsage: 4328, QuotaPeakNonPagedPoolUsage: 4680

- Virtual Memory Usage -
VirtualSize: 351391744, PeakVirtualSize: 356331520

- Pagefile Usage -
PagefileUsage: 256843776, PeakPagefileUsage: 304713728

- Working Set Size -
WorkingSetSize: 42737664, PeakWorkingSetSize: 195563520, PageFaultCount: 944412

*** Dump of thread ID 2892 (state: Waiting): ***

- Information -
Status: Wait Reason: UserRequest, ,

Unhandled Exception Detected...

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x00C08C14 read attempt to address 0x80000000

Engaging BOINC Windows Runtime Debugger...

</stderr_txt>
]]>



Here is the dump info from the Windows Crash Report:
<?xml version="1.0" encoding="UTF-16"?>
<DATABASE>
<EXE NAME="rosetta_beta_5.90_windows_intelx86.exe" FILTER="GRABMI_FILTER_PRIVACY">
<MATCHING_FILE NAME="rosetta_5.69_windows_intelx86.exe" SIZE="2570240" CHECKSUM="0x57279008" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="08/20/2007 21:12:18" UPTO_LINK_DATE="08/20/2007 21:12:18" />
<MATCHING_FILE NAME="rosetta_beta_5.90_windows_intelx86.exe" SIZE="2615808" CHECKSUM="0x3C8A6BC" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="12/20/2007 00:59:48" UPTO_LINK_DATE="12/20/2007 00:59:48" />
</EXE>
<EXE NAME="kernel32.dll" FILTER="GRABMI_FILTER_THISFILEONLY">
<MATCHING_FILE NAME="kernel32.dll" SIZE="984576" CHECKSUM="0xF0B331F6" BIN_FILE_VERSION="5.1.2600.3119" BIN_PRODUCT_VERSION="5.1.2600.3119" PRODUCT_VERSION="5.1.2600.3119" FILE_DESCRIPTION="Windows NT BASE API Client DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="5.1.2600.3119 (xpsp_sp2_gdr.070416-1301)" ORIGINAL_FILENAME="kernel32" INTERNAL_NAME="kernel32" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERFILEDATEHI="0x0" VERFILEDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0xF9293" LINKER_VERSION="0x50001" UPTO_BIN_FILE_VERSION="5.1.2600.3119" UPTO_BIN_PRODUCT_VERSION="5.1.2600.3119" LINK_DATE="04/16/2007 15:52:53" UPTO_LINK_DATE="04/16/2007 15:52:53" VER_LANGUAGE="English (United States) [0x409]" />
</EXE>
</DATABASE>
ID: 3582 · Report as offensive    Reply Quote
BigMike
Avatar

Send message
Joined: 23 Feb 06
Posts: 63
Credit: 58,730
RAC: 0
Message 3583 - Posted: 1 Jan 2008, 8:00:17 UTC

I don't know if this is a bug or not:

core_client_version>5.10.30</core_client_version>
<![CDATA[
<stderr_txt>
# cpu_run_time_pref: 3600
# random seed: 1559807
No heartbeat from core client for 31 sec - exiting
# cpu_run_time_pref: 3600
# random seed: 1559807
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
# cpu_run_time_pref: 3600
# random seed: 1559807
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
# cpu_run_time_pref: 3600
# random seed: 1559807
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
WARNING! Not sure non-ideal rotamers are compatible with symmetry yet...
======================================================
DONE :: 1 starting structures 4759.66 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================


BOINC :: Watchdog shutting down...
BOINC :: BOINC support services shutting down...


==Mike
Don't believe everything you think.
ID: 3583 · Report as offensive    Reply Quote
Evan

Send message
Joined: 23 Dec 07
Posts: 75
Credit: 69,584
RAC: 0
Message 3595 - Posted: 4 Jan 2008, 15:56:26 UTC

I think that there were problems with this one. Looking at the graphics I did notice that it was stuck/calculating for a long period around the 86% mark.



Z094__BOINC_SYMM_FOLD_AND_DOCK_RELAX_ONLY-Z094_-lowres_dock_-dock_3609__2788_1_0
W


**********************************************************************
Rosetta score is stuck or going too long. Watchdog is ending the run!
CPU time: 17003.9 seconds. Greater than 4X preferred time: 3600 seconds
**********************************************************************
ID: 3595 · Report as offensive    Reply Quote

Message boards : RALPH@home bug list : Bug reports for 5.90/5.91



©2024 University of Washington
http://www.bakerlab.org