Message boards : RALPH@home bug list : minirosetta v1.47 bug thread
Author | Message |
---|---|
mtyka Volunteer moderator Project developer Project scientist Send message Joined: 19 Mar 08 Posts: 79 Credit: 0 RAC: 0 |
This was a quick follow up update to fix an error that snuck into the update. see this thread for details: https://ralph.bakerlab.org/forum_thread.php?id=425 This should no longer produce this error: ERROR: not able to build valid fold-tree in JumpingFoldConstraints::setup_foldtree ERROR:: Exit from: src/protocols/abinitio/LoopJumpFoldCst.cc line: 108 called boinc_finish # cpu_run_time_pref: 14400 |
Reeltime Send message Joined: 1 Nov 08 Posts: 1 Credit: 6,349 RAC: 0 |
Not sure if this counts as a bug or not, but my runtime is set to 1 hr, most of the tasks take just over this mark c.65-70 mins. The 1.47 tasks are taking considerably longer. Current one is at 1hr 33 They are running normally upto about 78-80% then slowing down dramatically, then finishing somewhere about 90-91% Dont know if this is worth mentioning or not, so I thought I would :-) Host: 16239 If there is anything I need to check, filewise let me know, Im still fairly new to alpha testing Quick edit: Mentioned this because it is unusual for this project |
ramostol Send message Joined: 29 Mar 07 Posts: 24 Credit: 31,121 RAC: 0 |
This start is none too good I'm afraid. All cc2_1_8_mammoth-tasks are crashing after about 1 minute of computing. An example: cc2_1_8_mammoth_fa_cst_hb_t369__IGNORE_THE_REST_1S3QA_7_6585_1_0 <message> process exited with code 193 (0xc1, -63) </message> <stderr_txt> minirosetta_1.47_i686-apple-darwin(90916,0xa0538fa0) malloc: *** error for object 0x1747d40: Non-aligned pointer being freed (2) *** set a breakpoint in malloc_error_break to debug SIGBUS: bus error |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
This WU and this one that I have also finished seem to take an unusual amount of time. Both of these ones took over 13 hours for just 1 Decoy. My preferences are set to 6 hours. As it took this time to complete a single decoy that is the reason for the long running time. No wonder they are called Mammoth work units. Both completed ok (credit very low for the effort put in, but that is normal for both Ralph and Rosetta). |
Phil Send message Joined: 28 Jan 07 Posts: 5 Credit: 1,206 RAC: 0 |
|
Phil Send message Joined: 28 Jan 07 Posts: 5 Credit: 1,206 RAC: 0 |
The Graphics in this one show the following: Interesting, I got a bunch of mammoths now for the same machine but running XP rather than Linux and the display is correct. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
Have now had This Task run for 58,307.80 seconds or over 16 hours with the generation of just the 1 decoy. They are getting longer. |
AdeB Send message Joined: 22 Dec 07 Posts: 61 Credit: 161,367 RAC: 0 |
Another long task - over 10 hours for 1 decoy What surprises me is that boinc during those 10 hours never switched to an other project. There was work for other projects and [Switch between applications every] is set to 120 minutes. It looks like this task 'hijacked' my PC until it was finished. Should it behave like this? I also saw the strange values for Total Credit and RAC Phil is describing. Also on a linux PC. AdeB |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
It looks like this task 'hijacked' my PC until it was finished. Should it behave like this? Sometimes it can seem that way. Ralph has short (3 day) deadlines, and so can easily find itself running "at high priority" on the BOINC list. The other way this can happen is that BOINC tries to switch projects at checkpoints to preserve all the work possible (even for those not keeping tasks in memory). And some of these long running models do not take checkpoints. So BOINC was sitting there thinking it was just 10 min. from being done, and seeing no checkpoint to cut in on, so it just kept running it. Another other way this can happen is if you rack up debt to Ralph when no work is available. BOINC knows it "owes" time to Ralph and so keeps running it. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system. Why has the Credit system that Rosetta changed to and Ralph was also changed to 6 months to a year ago now reverting back to Benchmark ??? Based on this I am no longer getting due value for the time I spend crunching a work unit. I have seen other systems here on Ralph which have huge Benchmarks compared to me getting well over a hundred credits (114 was one example I saw for 13,400 seconds work) for 3 hours work when I do 6 or more hours work and don't get anywhere near as much as they do (from 55 to 90 for 4 to 7 hours). Because of this a number of users don't understand what I complain about when I say credit is low at Ralph and Rosetta (for me 10 to 12 cr/h at the moment, down from 14 to 15 a few months ago which is still low compared to Seti and others) as they are getting up to 30 cr/h. Can this be looked at please ?? If I do a 16 hour WU (like the current ones) I get 204 credits, others do a 3 hour WU and get 114, I don't see the fairness in that. My computers and results are easy to access and open to view. |
feet1st Send message Joined: 7 Mar 06 Posts: 313 Credit: 116,623 RAC: 0 |
My t328 mammoth is still on model 1 step 931,000 after nearly 17hrs ... and, of course, it's time to reboot to install MS fixes! ...wish me luck! [update] Interesting... it restarted on model 1, step 0 (yes, I waited for it to initialize and start incrementing steps) but with 2hr15min of CPU time on it. So, it's like it did take a checkpoint... only it didn't. Should be an interesting output file! |
Path7 Send message Joined: 11 Feb 08 Posts: 56 Credit: 4,974 RAC: 0 |
Hello all, The next WU ran for about 4 hours, when I had to reboot my PC due to an IE7-update. cc2_1_8_mammoth_mix_cen_cst_hb_t342__IGNORE_THE_REST_2G0QA_1_6636_1_0 The WU restarted from 0:00 hours runtime and finished within 4656 seconds (1.29 hours), and generated 1 decoy; valid. Also nice within my runtime preference of 7200 seconds. Why did this WU run for more than 4 hours at its first run? Have a nice day, Path7. |
Stephen Send message Joined: 17 Dec 08 Posts: 3 Credit: 6,566 RAC: 0 |
i'm getting some odd behavior. * cpu timer sometimes is getting reset * i suspended all work units, then unsuspended them and they all completed immediately. |
Stephen Send message Joined: 17 Dec 08 Posts: 3 Credit: 6,566 RAC: 0 |
to elaborate on the problem: a WU will get to around 85% complete , progress will stay the same. time to completion stays around 10 minutes. i suspend all tasks, resume then the "stuck" WUs will complete |
zombie67 [MM] Send message Joined: 8 Aug 06 Posts: 75 Credit: 2,396,363 RAC: 6,299 |
I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system. How so? Your machines claim based on benchmarks. If your benchmarks are not tampered with, then you are getting exactly what you are due. You can't just look at run time. Some machines are faster than others. So a fast machine running 4 hours will have done more work than a slower machine running 4 hours. So the faster machine should be awarded more credits, even though the crunch time is equal. Reno, NV Team: SETI.USA |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system. What I am referring to is not the fact that I am getting granted a benchmark score (and no they are not tampered with as you can tell by the low figures on my computers), it is the fact that the crediting system on Ralph and Rosetta was no longer based on the Boinc Benchmark value and therefore I should not be getting the same as claimed. The crediting system is supposed to be based on number of decoys generated as well as when it is returned and length of processing with the first to be returned in a batch gets what they claim then each one after that gets some form of averaging to get the final amount. At the moment it would appear that all results are getting what they claim which is not how the Rosetta/Ralph fixed type crediting system was meant to be, unless of course I am some how returning all my work before any one else in my batch, this I don't believe due to my 6 run time preference. |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
to elaborate on the problem: With these current 'mammoth' work units I too have noticed that they get to a point with around 10 minutes to go and sit there for quite some time. The work units appear to be compiling all data generated before then finishing the task. I have had them run for over 16 hours for just the 1 Decoy and have finished ok with a valid result. |
zombie67 [MM] Send message Joined: 8 Aug 06 Posts: 75 Credit: 2,396,363 RAC: 6,299 |
Yes, I understand that the credit system changed back to pure benchmark. I noticed that too. But the unique method that used to be used here (and still used on Rosetta) is also benchmark based. It just averages with all the previous claims for that particular test. So in theory, as long as we don't mess with the benchmarks, the awarded credits should be about the same either way. Edit: I'm guessing the method changed back to the default when the server upgrade happened. Reno, NV Team: SETI.USA |
Klimax Send message Joined: 7 Nov 07 Posts: 9 Credit: 11,583 RAC: 0 |
Hello, I have failure of three lr6_score12_... WU https://ralph.bakerlab.org/result.php?resultid=1241954 https://ralph.bakerlab.org/result.php?resultid=1241953 https://ralph.bakerlab.org/result.php?resultid=1241939 apparently some sort of crash (maybe bug?) |
Klimax Send message Joined: 7 Nov 07 Posts: 9 Credit: 11,583 RAC: 0 |
Hello, another three(all crashing in same function) https://ralph.bakerlab.org/result.php?resultid=1241948 https://ralph.bakerlab.org/result.php?resultid=1241947 https://ralph.bakerlab.org/result.php?resultid=1241936 |
Message boards :
RALPH@home bug list :
minirosetta v1.47 bug thread
©2024 University of Washington
http://www.bakerlab.org