Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here
Previous · 1 . . . 3 · 4 · 5 · 6
Author | Message |
---|---|
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
on another machine I have an "hblr" 4.99 stuck at one percent, model 1, step 0, stage "initializing". I'd restarted boinc after getting a fatal error message, and seeing this wu at 59 hours and change. It is work unit HBLR_1.0_1di2_377_4. It's still chugging along, but from the model and step numbers I think it's going in circles. Cpu time 104:11:17, 2.795% done, 108:17:52 remaining graph shows 17 red dots Stage: full atom relax, Model 1, step 31473. It was at model 1, step 31878 many hours ago. This is not being switched, paused, or removed from memory. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
on another machine I have an "hblr" 4.99 stuck at one percent, model 1, step 0, stage "initializing". I'd restarted boinc after getting a fatal error message, and seeing this wu at 59 hours and change. It is work unit HBLR_1.0_1di2_377_4. OK, I just watched it switch back to Model 1 Step 130, now it's 2.8155% done and there are NO red dots, but plenty of teal ones, and they're in a completely different pattern from what it just was. CPU time 104:30:12. It had gotten up to model 1 step 31535 (was the last I saw and only a few minutes had passed so it could have gone much beyond that). in the time it took to type this the steps jumped up to 28000ish. it only stayed in Ab initio maybe a minute or two, and is now in full atom relax, and my red dots are back. I think the scale prevents me from seeing them all yet. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Is switching from ab initio to full atom relax to ab intio to full atom relax and on and on and on within the same model normal? |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
on another machine I have an "hblr" 4.99 stuck at one percent, model 1, step 0, stage "initializing". I'd restarted boinc after getting a fatal error message, and seeing this wu at 59 hours and change. It is work unit HBLR_1.0_1di2_377_4. OK, still running, 22 red dots, model 1, step 32190. 114:42:18 cpu time, 3.306% done, 117:51:40 remaining |
Carlos_Pfitzner Send message Joined: 16 Feb 06 Posts: 182 Credit: 22,792 RAC: 0 |
mmciastro the current version is Rosetta_beta 5.00 Why are u reporting bugs of old versions here ? If u want to finish u old version jobs, u are free to do it, however I believe that cause version 4.99 had obvious bugs, this will only increase u credit loss. Better, u abort all of them, and help us testing the current version that is available to windows mac and linux. but do what u think is right -:( ps: If possible do not post more bugs of *obsolete* versions. This only servers to confuse developers, that believe this way that there is still the 1% bug ... and thus, delay them on making 5.00 the production version of Rosetta. Thanks Click signature for global team stats |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Carlos, have they said they can't use older results to help debug the future versions? Have they said "always delete all results after a new versions come out? I haven't seen that. Hence, I'm reporting it and waiting for further instructions from Mod9/developers. I'd hate to dump it if it can be useful. Maybe they've already found the problem. If so someone should say something. This is an alpha project. Boinc Alpha wants reports from previous versions. They still have the 4.99 threads listed for use, that says, they still want reports or haven't "closed" them yet. Either way, I want someone to tell me if this is useful (see my first and succeeding posts). I will continue to post this until someone says otherwise. I question my posts qualifying for acceptance to this thread, but it started as a 1% bug. Mod9 can feel free to move or delete it. All I need is some guidance from management as to how I can best help them. tony Formerly mmciastro. Name and avatar changed for a change The New Online Helpsytem help is just a call away. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Older versions can be helpful, see the following WU(this is very typical as I scan my results page): 79398 566 7 Apr 2006 22:24:47 UTC 8 Apr 2006 8:51:57 UTC Over Client error Computing 212.18 0.40 --- 81388 1531 8 Apr 2006 14:53:27 UTC 15 Apr 2006 18:53:05 UTC Over Client error Computing 130.30 0.27 --- 88134 2175 15 Apr 2006 18:53:29 UTC 18 Apr 2006 11:42:31 UTC Over Success Done 15,423.92 24.23 24.23 notice how the first two users had "client error computing" (these were "unhandled exceptions"), and yet I did it successfully? This tells them that the wu itself can be finished and isn't bad in all cases, it's just that some conditional difference exists between the first two users and myself. The question that can help debug becomes "what's different between the first two users and the third. If I had aborted it, they wouldn't have this info to work with. |
Mike Gelvin Send message Joined: 17 Feb 06 Posts: 50 Credit: 55,397 RAC: 0 |
Carlos, have they said they can't use older results to help debug the future versions? Have they said "always delete all results after a new versions come out? I haven't seen that. Hence, I'm reporting it and waiting for further instructions from Mod9/developers. I'd hate to dump it if it can be useful. Maybe they've already found the problem. If so someone should say something. This is an alpha project. Boinc Alpha wants reports from previous versions. They still have the 4.99 threads listed for use, that says, they still want reports or haven't "closed" them yet. Either way, I want someone to tell me if this is useful (see my first and succeeding posts). I will continue to post this until someone says otherwise. Tony, I agree with you. You never know what the next version is really testing, might not be anything to do with the 1% bug and hence your observations/questions are very valid. I suspect they (devs) need all the help they can get. Mike |
Divide Overflow Send message Joined: 15 Feb 06 Posts: 12 Credit: 128,027 RAC: 0 |
Tony, I agree with you as well. The directive from the dev's is *not* to abort work units unless specifically asked to. https://ralph.bakerlab.org/forum_thread.php?id=18 If it's not giving you any problems and progressing properly, let it crunch! If you have a question or problem about what is currently being crunched, post and ask about it. The dev's are smart enough not to get confused about fixed vs. onging issues. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Tony, I agree with you as well. The directive from the dev's is *not* to abort work units unless specifically asked to. https://ralph.bakerlab.org/forum_thread.php?id=18 David, that's part of the question I need answered, it's a timex, in that it keeps crunching, graphics work well, all the bits move, % done advances, CPU time advances, and even the "estimate to completion" moves, but it keeps getting higher. This could be due to the way win98 counts time. I see it run "Ab initio", then switch to "full atom relax", then it loops back to "ab initio" and starts all over again. All the while staying on "model 1". Is this how others see it working? I was thinking it did "ab initio", then "full atom relax", and then switched to the next model, but I'm not sure which way is "normal". tony |
tralala Send message Joined: 12 Apr 06 Posts: 52 Credit: 15,257 RAC: 0 |
Tony, I agree with you as well. The directive from the dev's is *not* to abort work units unless specifically asked to. https://ralph.bakerlab.org/forum_thread.php?id=18 @mmciastro I think you made your point now you should abort the WU. The error loop was discovered correctly from you and will help the devs no need to observe that loop another 100 hours. And yes there is already version 5.00 so abort all the 4.99 WUs since now the results of 5.00 do matter. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
Thank you This old machine just keeps chugging and I have no problem letting it continue as long as it may be useful. I don't care if it wants to run 2000 hours. I'm getting interested if it'll finish at 100%(that's 100 days from now at this rate,LOL) There is no debug software on that old machine (it's stuffed under an end table in the corner (ultra microatx frame), has no mouse, no keyboard,no monitor, it's only viewable via Realvnc. It is hooked to an UPS, but my fear is the memory leaks will be what causes this to stop crunching and not some other error. It's now at 3.5633% done, 122:54:06, stage Full atom relax, Model 1, Step 34155, 124:39:17 remaining, oh yeah, there's 24 red dots now (whatever the red dots are) My Ralph prefs: Resource share If you participate in multiple BOINC projects, this is the proportion of your resources used by RALPH@home 10 Percentage of CPU time used for graphics not selected Number of frames per second for graphics not selected Target CPU run time 4 hours Miscellaneous Should RALPH@home send you email newsletters? yes Should RALPH@home show your computers on its web site? yes Default computer location home my general prefs: Processor usage Do work while computer is running on batteries? (matters only for portable computers) yes Do work while computer is in use? yes Do work only between the hours of (no restriction) Leave applications in memory while preempted? (suspended applications will consume swap space if 'yes') yes Switch between applications every (recommended: 60 minutes) 180 minutes On multiprocessors, use at most 1 processors Disk and memory usage Use no more than 400 GB disk space Leave at least (Values smaller than 0.001 are ignored) .25 GB disk space free Use no more than 85% of total disk space Write to disk at most every 600 seconds Use no more than 100% of total virtual memory Network usage Connect to network about every (determines size of work cache; maximum 10 days) 3 days Confirm before connecting to Internet? (matters only if you have a modem, ISDN or VPN connection) no Disconnect when done? (matters only if you have a modem, ISDN or VPN connection) no Maximum download rate: 200 KB/s Maximum upload rate: 200 KB/s Use network only between the hours of Enforced by versions 4.46 and greater (no restriction) Skip image file verification? Check this ONLY if your Internet provider modifies image files (UMTS does this, for example). Skipping verification reduces the security of BOINC. no |
Divide Overflow Send message Joined: 15 Feb 06 Posts: 12 Credit: 128,027 RAC: 0 |
I had a v4.99 FACONTACTS_NOFILTERS WU that was behaving in a similar manner. Not stuck, but constant computing for the first model with incredibly slow completion % increases. After much debate, I decided something was wrong and finally aborted it after running for over 33 hours and only reaching 8% done. It was resent with the v5.00 app to another host and was finished successfully in a normal length of time. https://ralph.bakerlab.org/result.php?resultid=86791 Since I was running this on a WinxP machine, I think this problem is specific to the application and not your operating system. |
Astro Send message Joined: 16 Feb 06 Posts: 141 Credit: 32,977 RAC: 0 |
|
Rhiju Volunteer moderator Project developer Project scientist Send message Joined: 14 Feb 06 Posts: 161 Credit: 3,725 RAC: 0 |
Hi guys, we just posted the new ralph app 5.01, and are going to try to break it! I wanted to clarify one point. We *don't* yet have a fix for truly hanging jobs. We do have a rough fix for jobs that are constantly getting interrupted (say when BOINC switches to another project) and restarted without leaving Rosetta@home in memory. If that happens more than 5 times, we have Rosetta exit gracefully! But the more general problem -- if the client doesn't do anything for 10 minutes (or 100 hours as reported below!) -- isn't fixed. YET. Working on it. Please keep posting! Thanks, I aborted it. WU in question |
Rollo Send message Joined: 13 Apr 06 Posts: 4 Credit: 610 RAC: 0 |
This WU got stuck for more than 5 minutes without any movement at the graphics. Than crashed before I could abort it. Version 5.00 <core_client_version>5.4.4</core_client_version> <message>Unzulässige Funktion. (0x1) - exit code 1 (0x1) </message> <stderr_txt> # random seed: 3887865 # cpu_run_time_pref: 3600 ERROR:: Exit at: .tether.cc line:411 |
Steven Purvis Send message Joined: 1 Mar 06 Posts: 1 Credit: 8,880 RAC: 0 |
I got two which seemed to keep going at 1% ish (for ralph v4.99) Result 84721 and Result 84734 Result 84734 I allowed to run for 19 hours, where as Result 84721 I only allowed to run for a couple of hours. The 19 hours result seemed to haev terminated itself as it never seemed to get any more points on the graphic. I have Windows XP and BOINC CC v 5.2.7. I have just downloaded a new set of workunits with ralph beta 5.00. |
Dotsch Send message Joined: 4 Mar 06 Posts: 12 Credit: 13,725 RAC: 0 |
https://ralph.bakerlab.org/result.php?resultid=97260 https://ralph.bakerlab.org/workunit.php?wuid=86093 https://ralph.bakerlab.org/show_host_detail.php?hostid=2323 |
Message boards :
RALPH@home bug list :
Report \"stuck at 1%\" bugs here
©2024 University of Washington
http://www.bakerlab.org