Message boards : RALPH@home bug list : Bug reports for 5.90/5.91
Author | Message |
---|---|
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Update to 5.90! Keep us posted on weirdness. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
I am currently running 3 work units of the "1px8" variety. They are behaving very weirdly. I checked my computer and saw that only one WU was progressing in time and percentage done (a CPDN WU) and no other counters were moving. I have 4 cores and Boinc Manager said that my 3 other cores were all running at High Priority on my 3 Ralph work units. No CPU Time done, No Percent Done and No Time Left counters were moving. In fact The CPU Time was on 0.00% as was Percent done. Suspending and restarting made no difference. Stopped Boinc and restarted Boinc Manager and I now had the RALPH WU's showing over and hour had been completed and between 18% and 30% had been completed. But the counters still are not moving, only when you stop and restart the manager do you see what progress has been done. The WU's are still running at the moment 715097 715153 715237 I checked my system (Linux) and found all 4 cores are running so the WU's have not hung and are actually processing, it is just that I can't see what is happening. |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Hi Conan: Thanks, this is interesting and unexpected! I forgot to mention in the home page that we also updated the version of the BOINC api that is used to compile Rosetta@home, in anticipation of a new release of the BOINC client software (6.0). We'll look into whether this causes a problem with the reporting of % complete, etc. I am currently running 3 work units of the "1px8" variety. |
Thomas Leibold Send message Joined: 25 Feb 07 Posts: 27 Credit: 77,464 RAC: 0 |
Rosetta Thread I have problems with 5.90 in both Ralph and Rosetta (details posted in the Rosetta thread above). Besides the task display, it also appears that the Ralph task is running indefinitely (with 4 hour runtime configuration the workunit is already running over 24 hours). |
BigMike Send message Joined: 23 Feb 06 Posts: 63 Credit: 58,730 RAC: 0 |
This isn't really a bug, but more of an item that Rosetta@Home users will probably go nuts about. I suspect it has to do with the new BOINC API. Recent WU's have an interesting line that says ERRORS: CANCELLED. Any idea what it means? Might be useful to put it in the FAQs if it's always going to say that. ==Mike Don't believe everything you think. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
Hi Conan: Thanks, this is interesting and unexpected! I forgot to mention in the home page that we also updated the version of the BOINC api that is used to compile Rosetta@home, in anticipation of a new release of the BOINC client software (6.0). We'll look into whether this causes a problem with the reporting of % complete, etc. Thanks Rhiju, My next 4 WU's of the "s099" type are also doing the same thing. Running at "High Priority" and showing nothing in Boinc Manager as to how it is progressing. This is happening on a different computer. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
There appears to be a thread on the Linux CPU time issue in the BOINC Projects list: (user ID required) [boinc_projects] CPU Time not updated under linux |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
It would appear that no matter what time you have in your preferances it will not be adhered to. I had 4 Ralph WU's running when I went to bed (no idea how long they had been running as nothing was showing), all were at high priority. They were still running this morning so I stopped BM and restarted, this made the WU's immediatly report and give their stats. They had run for 19 to 20 hours and probably would of kept running till they were stopped. They produced 11 to 13 decoys. 718960 719023 719030 719087 |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
The linux executable should behave better now. Thanks to Conan others for the "front-lines" warning from ralph! Let us know how its working. It would appear that no matter what time you have in your preferances it will not be adhered to. |
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
I didn't see the message in, e.g., the stderr -- where is it showing up? This isn't really a bug, but more of an item that Rosetta@Home users will probably go nuts about. I suspect it has to do with the new BOINC API. |
Thomas Leibold Send message Joined: 25 Feb 07 Posts: 27 Credit: 77,464 RAC: 0 |
Rosetta Thread Update: the Ralph 5.90 task is now running for 48.5 hours (over 2 days). Is there are anything that will stop those 5.90 tasks from running indefinitely ? What happens when the task/workunit reaches the Report Deadline (other then no longer getting credit for it) ? Will that stop those 5.90 workunits or will they still continue on ? |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
Rosetta Thread Hello Thomas, The ones that I had only stopped when I stopped Boinc Manager and then restarted it. The Work Units then stopped and uploaded their results, also giving time taken and other stats. I did not have any of the ones I had error by doing this. Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works. |
Thomas Leibold Send message Joined: 25 Feb 07 Posts: 27 Credit: 77,464 RAC: 0 |
Hi Conan, it appears that restarting Boinc is indeed the only way to dislodge one of those stuck 5.90 workunits. Left unattended those 5.90 workunits will continue indefinitely!. Unfortunately restarting Boinc is not a complete fix either: as soon as Boinc starts up again it will process the next 5.90 that it still has in its queue.
Out of around 250 stuck 5.90 workunits less than 10 have made it back to the Rosetta servers on their own (some with 299 and more decoys completed). I don't know why those finished when most others don't, but most of those that did finish failed in the Rosetta validator (presumably too little cpu time such as 0.0004 seconds reported to be credible ?). On the other hand two of them did get a large amount of credit for the high number of completed decoys. The two 5.90 workunits on my home server (one Ralph, one Rosetta) both passed the Validator and did get huge amounts of credit when I restarted Boinc on that machine. More importantly the amount of cpu time reported back to Rosetta was correct for those two workunits! At least part of the Boinc/Rosetta system must be aware of the correct amount of time spend on the workunit and record it properly so that this amount of cpu time spend is available on restart.
I wish that was true, but I have proof to the contrary. That Ralph 5.90 task was running 50 hours which is a lot more than 6 times the preference time of 4 hours. I believe the only options available to cleanup this mess for Linux users are: Option A: 1.) Where possible, use Boinc Manager to abort all 5.90 workunits that are in "Ready to Start" state. This is necessary so that they will not become stuck too! 2.) Stop and restart the Boinc client. This should immediately send home all those 5.90 workunits that have been in progress and that continued to run beyond the allocated preference time. Those workunits should validate fine and will contribute to the Rosetta Science (the time was not wasted). 3.) Use Boinc Manager to verify that all remaining 5.90 tasks in the system are now in the "Ready to Report" state. Note: if you have a large Boinc cache and/or run multiple Boinc projects there is a possibility of a currently running 5.90 workunit not yet having exceeded the preferred runtime. In this case you will need to restart Boinc again when that workunit has exceeded its preferred run time. Option B: If you are unable to use the Boinc Manager gui (perhaps on headless servers in remote locations) you need to reset the Rosetta project. Within the BOINC directory run "./boinc_cmd --project https://boinc.bakerlab.org/rosetta reset". Note: this will remove all workunits and clients for the Rosetta project. This means all completed, but not yet reported work and all work in progress for this project will be lost! The boinc client will perform a fresh load of 5.91 workunits along with the 5.91 client as a result of the reset operation. |
Hans Sveen Send message Joined: 17 Feb 06 Posts: 11 Credit: 386,241 RAC: 51 |
Hello and Merry Christmas! I tried the show graphics is run, I saw something "funny"; it showed only the native foldig window. The others were empty! An other peculiar behavior is that the accepted Energy line is linear; never seen that before! It seems to run well behaved so far, so just for Your information. Screenshot Thank You! Hans Sveen Oslo, Norway |
ramostol Send message Joined: 29 Mar 07 Posts: 24 Credit: 31,121 RAC: 0 |
Another post I read in Rosetta said that there is a rule that will stop the Rosetta WU once it has run for 6 times its preference time, I have yet to verify this rule or prove it works. Actually 4 times its preference time. Observed and proven. |
Thomas Leibold Send message Joined: 25 Feb 07 Posts: 27 Credit: 77,464 RAC: 0 |
Correct, but probably not in the sense you meant it: it has been proven that the 5.90 workunits (with the defect Linux client) will NOT terminate by themselves regardless how much they exceed the preferred run time! While cleaning up the 5.90 mess I did find that this is apparently not entirely new either. On one server I found a stuck 5.81 workunit that had not only exceeded its preferred run time over ONE HUNDRED TIMES, but also exceeded the workunit deadline by a MONTH! It would not surprise me if the failure to terminate those kinds of rogue workunits is in some way related to the longstanding issues regarding Ralph/Rosetta watchdog on Linux (e.g.: timing dependencies during watchdog initiated client shutdown, corrupting the memory allocation chain by freeing the same block of memory twice resulting in subsequent segmentation faults which sometimes crashes the watchdog and fails to terminate the client IIRC). |
Dr Who Fan Send message Joined: 2 Sep 06 Posts: 76 Credit: 107,857 RAC: 0 |
When this one crashed it caused BOINC to stop working and also triggered the Windows runtime debugger/crash reporter. Name: 1zpy__BOINC_TWIST_RINGS_TWIST_ANGLE_SYMM_FOLD_AND_DOCK_RELAX-1zpy_-native__2756_503_1 It had a CPU time of 10077.23 seconds Here part of the dump info from BOINC: <core_client_version>6.1.0</core_client_version> <![CDATA[ <message> - exit code -1073741819 (0xc0000005) </message> <stderr_txt> # cpu_run_time_pref: 7200 # random seed: 1564821 Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00BA900B read attempt to address 0x12443000 Engaging BOINC Windows Runtime Debugger... ******************** *** Dump of the Process Statistics: *** - I/O Operations Counters - Read: 5007, Write: 0, Other 10312 - I/O Transfers Counters - Read: 0, Write: 186417, Other 0 - Paged Pool Usage - QuotaPagedPoolUsage: 28848, QuotaPeakPagedPoolUsage: 35360 QuotaNonPagedPoolUsage: 4328, QuotaPeakNonPagedPoolUsage: 4680 - Virtual Memory Usage - VirtualSize: 351391744, PeakVirtualSize: 356331520 - Pagefile Usage - PagefileUsage: 256843776, PeakPagefileUsage: 304713728 - Working Set Size - WorkingSetSize: 42737664, PeakWorkingSetSize: 195563520, PageFaultCount: 944412 *** Dump of thread ID 2892 (state: Waiting): *** - Information - Status: Wait Reason: UserRequest, , Unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x00C08C14 read attempt to address 0x80000000 Engaging BOINC Windows Runtime Debugger... </stderr_txt> ]]> Here is the dump info from the Windows Crash Report: <?xml version="1.0" encoding="UTF-16"?> <DATABASE> <EXE NAME="rosetta_beta_5.90_windows_intelx86.exe" FILTER="GRABMI_FILTER_PRIVACY"> <MATCHING_FILE NAME="rosetta_5.69_windows_intelx86.exe" SIZE="2570240" CHECKSUM="0x57279008" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="08/20/2007 21:12:18" UPTO_LINK_DATE="08/20/2007 21:12:18" /> <MATCHING_FILE NAME="rosetta_beta_5.90_windows_intelx86.exe" SIZE="2615808" CHECKSUM="0x3C8A6BC" MODULE_TYPE="WIN32" PE_CHECKSUM="0x0" LINKER_VERSION="0x0" LINK_DATE="12/20/2007 00:59:48" UPTO_LINK_DATE="12/20/2007 00:59:48" /> </EXE> <EXE NAME="kernel32.dll" FILTER="GRABMI_FILTER_THISFILEONLY"> <MATCHING_FILE NAME="kernel32.dll" SIZE="984576" CHECKSUM="0xF0B331F6" BIN_FILE_VERSION="5.1.2600.3119" BIN_PRODUCT_VERSION="5.1.2600.3119" PRODUCT_VERSION="5.1.2600.3119" FILE_DESCRIPTION="Windows NT BASE API Client DLL" COMPANY_NAME="Microsoft Corporation" PRODUCT_NAME="Microsoft® Windows® Operating System" FILE_VERSION="5.1.2600.3119 (xpsp_sp2_gdr.070416-1301)" ORIGINAL_FILENAME="kernel32" INTERNAL_NAME="kernel32" LEGAL_COPYRIGHT="© Microsoft Corporation. All rights reserved." VERFILEDATEHI="0x0" VERFILEDATELO="0x0" VERFILEOS="0x40004" VERFILETYPE="0x2" MODULE_TYPE="WIN32" PE_CHECKSUM="0xF9293" LINKER_VERSION="0x50001" UPTO_BIN_FILE_VERSION="5.1.2600.3119" UPTO_BIN_PRODUCT_VERSION="5.1.2600.3119" LINK_DATE="04/16/2007 15:52:53" UPTO_LINK_DATE="04/16/2007 15:52:53" VER_LANGUAGE="English (United States) [0x409]" /> </EXE> </DATABASE> |
BigMike Send message Joined: 23 Feb 06 Posts: 63 Credit: 58,730 RAC: 0 |
I don't know if this is a bug or not: core_client_version>5.10.30</core_client_version> <![CDATA[ <stderr_txt> # cpu_run_time_pref: 3600 # random seed: 1559807 No heartbeat from core client for 31 sec - exiting # cpu_run_time_pref: 3600 # random seed: 1559807 WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... # cpu_run_time_pref: 3600 # random seed: 1559807 WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... # cpu_run_time_pref: 3600 # random seed: 1559807 WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... WARNING! Not sure non-ideal rotamers are compatible with symmetry yet... ====================================================== DONE :: 1 starting structures 4759.66 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== BOINC :: Watchdog shutting down... BOINC :: BOINC support services shutting down... ==Mike Don't believe everything you think. |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
I think that there were problems with this one. Looking at the graphics I did notice that it was stuck/calculating for a long period around the 86% mark. Z094__BOINC_SYMM_FOLD_AND_DOCK_RELAX_ONLY-Z094_-lowres_dock_-dock_3609__2788_1_0 W ********************************************************************** Rosetta score is stuck or going too long. Watchdog is ending the run! CPU time: 17003.9 seconds. Greater than 4X preferred time: 3600 seconds ********************************************************************** |
Message boards :
RALPH@home bug list :
Bug reports for 5.90/5.91
©2024 University of Washington
http://www.bakerlab.org