Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread
Previous · 1 · 2 · 3 · 4 · 5 · 6 · Next
Author | Message |
---|---|
![]() Send message Joined: 13 Jan 09 Posts: 100 Credit: 331,865 RAC: 0 |
Both copies of this test_cc 1.50 workunit failed quickly: https://ralph.bakerlab.org/workunit.php?wuid=1105459 |
Evan Send message Joined: 23 Dec 07 Posts: 75 Credit: 69,584 RAC: 0 |
The mammoths have not been having a happy time. I have had a run of 23 failures (1st time) including: unhandled Exception Detected... - Unhandled Exception Record - Reason: Access Violation (0xc0000005) at address 0x0081E942 read attempt to address 0x00000000 and ERROR: unknown model name: 1B0NA_10 ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170 called boinc_finish |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
That was the step shown in the graphic. (to the right of the model number). I've just never seen such a large step number. |
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0 |
One more with v1.50: test_cc_1_8_nocst4_hb_t367__IGNORE_THE_REST_1UFBA_5_6830_3 <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> Watchdog active. # cpu_run_time_pref: 14400 ERROR: target_strands.size() ERROR:: Exit from: src/protocols/abinitio/TemplateJumpSetup.cc line: 94 called boinc_finish And one with v.1.51: test_cc2_1_8_mammoth_mix_cen_cst_hb_t311__IGNORE_THE_REST_1PERL_6_6852_1 <core_client_version>6.2.18</core_client_version> <![CDATA[ <message> process exited with code 1 (0x1, -255) </message> <stderr_txt> BOINC:: Initializing ... ok. [2009- 1-22 11:56:36:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Initializing options.... ok Initializing random generators... ok Initialization complete. Watchdog active. ERROR: unknown model name: 1B0NA_10 ERROR:: Exit from: src/protocols/abinitio/PairingStatistics.hh line: 170 called boinc_finish Snags |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I tried loading up the docking task I got. It is 1.51. Displayed graphic just as the task was starting. Waited, waited... finally realized it was using more CPU then the thread working on the protein! Double checked Ralph settings for % of CPU for the graphic, set to default which is 10%. Here's my screenshot showing the graphic monopolizing one core, while the two running tasks are competing for the other. Net result, nothing shown in the graphic after several minutes, and graphic thread consuming much more then 10% of CPU. [edit] I gave up on it, captured the screen, uploaded the screenshot, reported it here... then when I opened the graphic a second time it was better behaved. Not overusing CPU... but was essentially unusable. Go to resize or rotate the images and it wouldn't respond for about 30 seconds each time. |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
[quote]I tried loading up the docking task I got. It is 1.51. Displayed graphic just as the task was starting. Waited, waited... finally realized it was using more CPU then the thread working on the protein! Double checked Ralph settings for % of CPU for the graphic, set to default which is 10%. Here's my screenshot showing the graphic monopolizing one core, while the two running tasks are competing for the other. Net result, nothing shown in the graphic after several minutes, and graphic thread consuming much more then 10% of CPU. [quote] Feet1st , this is awesome - debugging on an unprecedented level :) Nice to get an idea of what all this looks like from your point of view. These docking tasks are new and not mine - lemme track down the person submiting these and make sure the graphics app can deal with it. Mike |
![]() Send message Joined: 16 Feb 06 Posts: 16 Credit: 39,518 RAC: 0 |
application version 1.51 https://ralph.bakerlab.org/result.php?resultid=1258676 <core_client_version>6.4.5</core_client_version> <![CDATA[ <message> Incorrect function. (0x1) - exit code 1 (0x1) </message> <stderr_txt> BOINC:: Initializing ... ok. [2009- 1-22 19: 2:51:] :: BOINC :: boinc_init() BOINC:: Setting up shared resources ... ok. BOINC:: Setting up semaphores ... ok. BOINC:: Updating status ... ok. BOINC:: Registering timer callback... ok. BOINC:: Worker initialized successfully. Initializing options.... ok Initializing random generators... ok Initialization complete. Watchdog active. ERROR: unknown model name: 1DK8A_1 ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170 called boinc_finish </stderr_txt> ]]> |
![]() Send message Joined: 14 Jan 09 Posts: 62 Credit: 33,293 RAC: 0 |
Otherwise this release has added debug information to let me figure out where all this stuff is failing. Believe me guys, we're now in the land whre i cannot reproduce these errors here what so ever. Not on the linux boxes, Mac boxes or windows boxes we have. nowhere. Why these remaining segfaults occur is a total mystery to me, so please bear with me. THis is going ot be incredibly difficult to track down. Questions I ask myself, Are they happening in the same module? Are they happening because of the same type of activity? What is common to all the events? Could it be something external? I know BOINC is supposed to be keeping things isolated and that application A is not supposed to affect application B ... but I have seen enough odd things that I am not convinced that this is completely so ... As to the tasks that fail ... you say they don't fail for you ... Why not grab the set and issue them to us with high replication counts ... then if they fail for everyone, that tells us one thing ... if they only fail for some and not others ... that tells us something else ... To my mind, the cases that fail are the ones that you should be saving and using as your issue tasks for each round of testing. Not to keep making new tasks in all cases, but to use those tasks that have proven their ability to cause a problem. Just some things to consider ... oh, and you are out of work again ... |
HA-SOFT, s.r.o. Send message Joined: 19 Jan 09 Posts: 6 Credit: 19,644 RAC: 0 |
DEP - execution protection on new Intel and AMD cores. How can I turn graphics off? |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
v1.51 ERROR: unknown model name: 2FRHA_10 ERROR:: Exit from: d:boinc_buildminirosetta_windowsminisrcprotocols/abinitio/PairingStatistics.hh line: 170 called boinc_finish on task test_cc2_1_8_mammoth_mix_cen_cst_hb_t327__IGNORE_THE_REST_2F2EA_7_6860_1_1 |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
I've seen a number of other reports of the screen saver/graphic just displaying as a black window. And I had always assumed people just weren't waiting long enough for the display to refresh. It does often take a long time, and I always assumed this was due to the allowed % of CPU time for the graphic. But, on the other hand, this wasn't an issue before about 2 or 3 months ago. I meant to point out that the screenshot shows the graphic thread has used 4:02 of CPU time, but the corresponding Ralph thread is shown with only 2:11 of CPU received so far. So the % CPU shown in the screenshot is only the last interval, but you can see from the totals that the number is roughly what it's been during the entire 4 minutes the task has been running. I was wondering if perhaps the graphic could always display at least it's grid lines, and text immediately, and perhaps a "protein information being retrieved... please wait" message in the frames. That way at least it would never just be "blank". |
I _ quit Send message Joined: 13 Jan 09 Posts: 44 Credit: 88,562 RAC: 0 |
Cool idea...I would love to see that as well. |
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0 |
test_cc_1_8_nocst4_hb_t327__IGNORE_THE_REST_2FSWA_6_6888_1 andtest_cc_1_8_nocst4_hb_t327__IGNORE_THE_REST_2F2EA_10_6888_1_1 both ended with: ERROR: unknown model name: 2FRHA_10 ERROR:: Exit from: src/protocols/abinitio/PairingStatistics.hh line: 170 This time they ran a few minutes instead of a few seconds, claimed to be done instead of declaring a compute error and received a validate error instead of a client error. Snags |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
we already submit identical jobs to many many machines - and they only fail *sometimes*. Even when the same random number seed is used.
again .. we're already doing that. But the errors are sporadic in nature so submitting the same task might not throw an error "this time". I have noticed though that some of the problems seem to be highly machine specific. Like one user will always produce this one kind of an error and nithing else. weired. something strange about their setup ? No idea. |
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
Thank you all for reporting the problems! The problem divides it self into three parts, firstly cornering the error, turning it into an erro message, reproducing it here and then finding the root of the problem. At any stage there are three types of problem: a) Ones that fail with an error message like this:
These ones we've basically identified. Why they occur is merely a matter of tracking down the route. These ones i have already passed on to the relevant programmers who are fixing htem as we speak. b) Segfaults with traces: - Callstack - Segfault - bad. But at least i get a trace - i can thus look into the code and understand (or guess) at why the program could fail at that point. Then i can turn them into a) type errors.
These are random segfault that occured somewhere. I can do virtually nothing from here. If i rerun the *exact same commandline* from here all runs fine. So these ones i can only tentatively bracket with stderr statemenets. Right now i am mainly trying to turn b) and c) into a). Once an error is in the form of a) its usually trivial to solve! Anyway,.. just incase you guys were interested in what i'm trying to do. oh i guess there are also errors that are *behavioral* errors. Those are the ones i cannot see from here but you guys have to tell me about. Stuff like the graphics or overrunning models or other strange behaviour. :-) |
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0 |
Hey, Mike, thanks for all the info you've been giving us lately. It's interesting and certainly provides a bit of extra motivation to keep crunching and reporting what we see at our end. Snags |
franfranlatulipe94@hotmail.com Send message Joined: 12 Dec 08 Posts: 1 Credit: 447 RAC: 0 |
Hi, I'm running RALPH with Ubuntu Jaunty up to date. My BOINC version is 6.2.18. When I click to "show graphic" (or suchs, I'm French), I see a small window, like the screenshot in XP in this Post, but all my memory (1.5 Gio) and all my swap (2.0 Gio) are full… If I do not it (click on "sow graphics"), there isn't any visible problem. Rosetta mini 1.52 http://francois.linkmauve.fr/boinc.png The screenshot show when the window has been killed. If there are questions, ask me… |
Snagletooth Send message Joined: 4 May 07 Posts: 67 Credit: 134,427 RAC: 0 |
Hi, I'm showing the same or very similar behavior on the same type of workunit:lr6_D_score12_rlbn_1gu3_IGNORE_THE_REST_NATIVE_6909_1 running Leopard on an Intel Mac. No cpu use but tons of memory and I don't even get the box so nothing to show in a screen capture. Activity Manager says the process has hung and I killed it without trouble from there. Snags |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
Mtyka, can you disable graphics for some WUs and see if the error rate goes down? Might help turning c)'s into b)'s. |
![]() Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
FYI, in French "Gio" is "GB" (gigabytes) |
Message boards :
RALPH@home bug list :
minirosetta v1.48-1.51 bug thread
©2023 University of Washington
http://www.bakerlab.org