minirosetta v1.47 bug thread

Message boards : RALPH@home bug list : minirosetta v1.47 bug thread

To post messages, you must log in.

1 · 2 · Next

AuthorMessage
mtyka
Volunteer moderator
Project developer
Project scientist

Send message
Joined: 19 Mar 08
Posts: 79
Credit: 0
RAC: 0
Message 4410 - Posted: 13 Dec 2008, 18:42:14 UTC

This was a quick follow up update to fix an error that snuck into the update. see this thread for details:
https://ralph.bakerlab.org/forum_thread.php?id=425

This should no longer produce this error:

ERROR: not able to build valid fold-tree in JumpingFoldConstraints::setup_foldtree
ERROR:: Exit from: src/protocols/abinitio/LoopJumpFoldCst.cc line: 108
called boinc_finish
# cpu_run_time_pref: 14400
ID: 4410 · Report as offensive    Reply Quote
Profile Reeltime

Send message
Joined: 1 Nov 08
Posts: 1
Credit: 6,349
RAC: 0
Message 4413 - Posted: 14 Dec 2008, 15:16:31 UTC
Last modified: 14 Dec 2008, 15:17:37 UTC

Not sure if this counts as a bug or not, but my runtime is set to 1 hr, most of the tasks take just over this mark c.65-70 mins.

The 1.47 tasks are taking considerably longer. Current one is at 1hr 33

They are running normally upto about 78-80% then slowing down dramatically, then finishing somewhere about 90-91%

Dont know if this is worth mentioning or not, so I thought I would :-)

Host: 16239

If there is anything I need to check, filewise let me know, Im still fairly new to alpha testing

Quick edit: Mentioned this because it is unusual for this project
ID: 4413 · Report as offensive    Reply Quote
ramostol

Send message
Joined: 29 Mar 07
Posts: 24
Credit: 31,121
RAC: 0
Message 4419 - Posted: 16 Dec 2008, 10:26:15 UTC

This start is none too good I'm afraid.

All cc2_1_8_mammoth-tasks are crashing after about 1 minute of computing.

An example:

cc2_1_8_mammoth_fa_cst_hb_t369__IGNORE_THE_REST_1S3QA_7_6585_1_0

<message>
process exited with code 193 (0xc1, -63)
</message>
<stderr_txt>
minirosetta_1.47_i686-apple-darwin(90916,0xa0538fa0) malloc: *** error for object 0x1747d40: Non-aligned pointer being freed (2)
*** set a breakpoint in malloc_error_break to debug
SIGBUS: bus error
ID: 4419 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4420 - Posted: 16 Dec 2008, 11:39:20 UTC
Last modified: 16 Dec 2008, 11:41:21 UTC

This WU and this one that I have also finished seem to take an unusual amount of time.

Both of these ones took over 13 hours for just 1 Decoy.

My preferences are set to 6 hours.

As it took this time to complete a single decoy that is the reason for the long running time.

No wonder they are called Mammoth work units.

Both completed ok (credit very low for the effort put in, but that is normal for both Ralph and Rosetta).
ID: 4420 · Report as offensive    Reply Quote
Phil

Send message
Joined: 28 Jan 07
Posts: 5
Credit: 1,206
RAC: 0
Message 4421 - Posted: 16 Dec 2008, 17:55:15 UTC

The Graphics in this one show the following:

Total Credit: -5.6988E-05
RAC 5.3133E-315
ID: 4421 · Report as offensive    Reply Quote
Phil

Send message
Joined: 28 Jan 07
Posts: 5
Credit: 1,206
RAC: 0
Message 4422 - Posted: 16 Dec 2008, 22:26:32 UTC - in response to Message 4421.  
Last modified: 16 Dec 2008, 23:01:29 UTC

The Graphics in this one show the following:

Total Credit: -5.6988E-05
RAC 5.3133E-315



Interesting, I got a bunch of mammoths now for the same machine but running XP rather than Linux and the display is correct.
ID: 4422 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4423 - Posted: 17 Dec 2008, 9:52:18 UTC - in response to Message 4420.  

Have now had This Task run for 58,307.80 seconds or over 16 hours with the generation of just the 1 decoy.

They are getting longer.
ID: 4423 · Report as offensive    Reply Quote
AdeB
Avatar

Send message
Joined: 22 Dec 07
Posts: 61
Credit: 161,367
RAC: 0
Message 4424 - Posted: 17 Dec 2008, 19:04:41 UTC

Another long task - over 10 hours for 1 decoy

What surprises me is that boinc during those 10 hours never switched to an other project. There was work for other projects and [Switch between applications every] is set to 120 minutes. It looks like this task 'hijacked' my PC until it was finished. Should it behave like this?

I also saw the strange values for Total Credit and RAC Phil is describing. Also on a linux PC.

AdeB
ID: 4424 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4425 - Posted: 17 Dec 2008, 20:30:47 UTC - in response to Message 4424.  
Last modified: 17 Dec 2008, 20:35:28 UTC

It looks like this task 'hijacked' my PC until it was finished. Should it behave like this?


Sometimes it can seem that way. Ralph has short (3 day) deadlines, and so can easily find itself running "at high priority" on the BOINC list.

The other way this can happen is that BOINC tries to switch projects at checkpoints to preserve all the work possible (even for those not keeping tasks in memory). And some of these long running models do not take checkpoints. So BOINC was sitting there thinking it was just 10 min. from being done, and seeing no checkpoint to cut in on, so it just kept running it.

Another other way this can happen is if you rack up debt to Ralph when no work is available. BOINC knows it "owes" time to Ralph and so keeps running it.
ID: 4425 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4426 - Posted: 18 Dec 2008, 1:18:58 UTC

I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system.

Why has the Credit system that Rosetta changed to and Ralph was also changed to 6 months to a year ago now reverting back to Benchmark ???

Based on this I am no longer getting due value for the time I spend crunching a work unit.

I have seen other systems here on Ralph which have huge Benchmarks compared to me getting well over a hundred credits (114 was one example I saw for 13,400 seconds work) for 3 hours work when I do 6 or more hours work and don't get anywhere near as much as they do (from 55 to 90 for 4 to 7 hours).

Because of this a number of users don't understand what I complain about when I say credit is low at Ralph and Rosetta (for me 10 to 12 cr/h at the moment, down from 14 to 15 a few months ago which is still low compared to Seti and others) as they are getting up to 30 cr/h.

Can this be looked at please ??

If I do a 16 hour WU (like the current ones) I get 204 credits, others do a 3 hour WU and get 114, I don't see the fairness in that.
My computers and results are easy to access and open to view.
ID: 4426 · Report as offensive    Reply Quote
Profile feet1st

Send message
Joined: 7 Mar 06
Posts: 313
Credit: 116,623
RAC: 0
Message 4427 - Posted: 18 Dec 2008, 2:03:05 UTC
Last modified: 18 Dec 2008, 2:13:00 UTC

My t328 mammoth is still on model 1 step 931,000 after nearly 17hrs ... and, of course, it's time to reboot to install MS fixes! ...wish me luck!

[update]
Interesting... it restarted on model 1, step 0 (yes, I waited for it to initialize and start incrementing steps) but with 2hr15min of CPU time on it. So, it's like it did take a checkpoint... only it didn't. Should be an interesting output file!
ID: 4427 · Report as offensive    Reply Quote
Path7

Send message
Joined: 11 Feb 08
Posts: 56
Credit: 4,974
RAC: 0
Message 4428 - Posted: 18 Dec 2008, 19:19:28 UTC

Hello all,

The next WU ran for about 4 hours, when I had to reboot my PC due to an IE7-update.
cc2_1_8_mammoth_mix_cen_cst_hb_t342__IGNORE_THE_REST_2G0QA_1_6636_1_0
The WU restarted from 0:00 hours runtime and finished within 4656 seconds (1.29 hours),
and generated 1 decoy; valid. Also nice within my runtime preference of 7200 seconds.

Why did this WU run for more than 4 hours at its first run?

Have a nice day,
Path7.

ID: 4428 · Report as offensive    Reply Quote
Profile Stephen

Send message
Joined: 17 Dec 08
Posts: 3
Credit: 6,566
RAC: 0
Message 4430 - Posted: 19 Dec 2008, 0:31:16 UTC

i'm getting some odd behavior.


* cpu timer sometimes is getting reset
* i suspended all work units, then unsuspended them and they all completed immediately.

ID: 4430 · Report as offensive    Reply Quote
Profile Stephen

Send message
Joined: 17 Dec 08
Posts: 3
Credit: 6,566
RAC: 0
Message 4431 - Posted: 19 Dec 2008, 1:48:18 UTC - in response to Message 4430.  

to elaborate on the problem:

a WU will get to around 85% complete , progress will stay the same. time to completion stays around 10 minutes. i suspend all tasks, resume then the "stuck" WUs will complete
ID: 4431 · Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 8 Aug 06
Posts: 75
Credit: 2,396,363
RAC: 6,299
Message 4432 - Posted: 19 Dec 2008, 4:04:32 UTC - in response to Message 4426.  

I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system.

Why has the Credit system that Rosetta changed to and Ralph was also changed to 6 months to a year ago now reverting back to Benchmark ???

Based on this I am no longer getting due value for the time I spend crunching a work unit.


How so? Your machines claim based on benchmarks. If your benchmarks are not tampered with, then you are getting exactly what you are due. You can't just look at run time. Some machines are faster than others. So a fast machine running 4 hours will have done more work than a slower machine running 4 hours. So the faster machine should be awarded more credits, even though the crunch time is equal.
Reno, NV
Team: SETI.USA
ID: 4432 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4434 - Posted: 19 Dec 2008, 13:36:48 UTC - in response to Message 4432.  

I have been doing both Ralph and Rosetta for quite some time now (was even number 1 in Ralph at one time), and I have noticed on Ralph over the last number of batch jobs that the Granted Credit equals the Claimed Credit and seems based on the Boinc Benchmark system.

Why has the Credit system that Rosetta changed to and Ralph was also changed to 6 months to a year ago now reverting back to Benchmark ???

Based on this I am no longer getting due value for the time I spend crunching a work unit.


How so? Your machines claim based on benchmarks. If your benchmarks are not tampered with, then you are getting exactly what you are due. You can't just look at run time. Some machines are faster than others. So a fast machine running 4 hours will have done more work than a slower machine running 4 hours. So the faster machine should be awarded more credits, even though the crunch time is equal.


What I am referring to is not the fact that I am getting granted a benchmark score (and no they are not tampered with as you can tell by the low figures on my computers), it is the fact that the crediting system on Ralph and Rosetta was no longer based on the Boinc Benchmark value and therefore I should not be getting the same as claimed.

The crediting system is supposed to be based on number of decoys generated as well as when it is returned and length of processing with the first to be returned in a batch gets what they claim then each one after that gets some form of averaging to get the final amount.

At the moment it would appear that all results are getting what they claim which is not how the Rosetta/Ralph fixed type crediting system was meant to be,
unless of course I am some how returning all my work before any one else in my batch, this I don't believe due to my 6 run time preference.
ID: 4434 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 364
Credit: 1,368,421
RAC: 0
Message 4435 - Posted: 19 Dec 2008, 13:41:15 UTC - in response to Message 4431.  

to elaborate on the problem:

a WU will get to around 85% complete , progress will stay the same. time to completion stays around 10 minutes. i suspend all tasks, resume then the "stuck" WUs will complete


With these current 'mammoth' work units I too have noticed that they get to a point with around 10 minutes to go and sit there for quite some time.

The work units appear to be compiling all data generated before then finishing the task.
I have had them run for over 16 hours for just the 1 Decoy and have finished ok with a valid result.
ID: 4435 · Report as offensive    Reply Quote
zombie67 [MM]
Avatar

Send message
Joined: 8 Aug 06
Posts: 75
Credit: 2,396,363
RAC: 6,299
Message 4436 - Posted: 19 Dec 2008, 15:59:31 UTC - in response to Message 4434.  
Last modified: 19 Dec 2008, 16:02:07 UTC

Yes, I understand that the credit system changed back to pure benchmark. I noticed that too. But the unique method that used to be used here (and still used on Rosetta) is also benchmark based. It just averages with all the previous claims for that particular test. So in theory, as long as we don't mess with the benchmarks, the awarded credits should be about the same either way.

Edit: I'm guessing the method changed back to the default when the server upgrade happened.
Reno, NV
Team: SETI.USA
ID: 4436 · Report as offensive    Reply Quote
Klimax

Send message
Joined: 7 Nov 07
Posts: 9
Credit: 11,583
RAC: 0
Message 4440 - Posted: 27 Dec 2008, 6:13:54 UTC

Hello,
I have failure of three lr6_score12_... WU
https://ralph.bakerlab.org/result.php?resultid=1241954
https://ralph.bakerlab.org/result.php?resultid=1241953
https://ralph.bakerlab.org/result.php?resultid=1241939

apparently some sort of crash
(maybe bug?)
ID: 4440 · Report as offensive    Reply Quote
Klimax

Send message
Joined: 7 Nov 07
Posts: 9
Credit: 11,583
RAC: 0
Message 4441 - Posted: 27 Dec 2008, 12:18:19 UTC - in response to Message 4440.  

Hello,
I have failure of three lr6_score12_... WU
https://ralph.bakerlab.org/result.php?resultid=1241954
https://ralph.bakerlab.org/result.php?resultid=1241953
https://ralph.bakerlab.org/result.php?resultid=1241939

apparently some sort of crash
(maybe bug?)


another three(all crashing in same function)

https://ralph.bakerlab.org/result.php?resultid=1241948
https://ralph.bakerlab.org/result.php?resultid=1241947
https://ralph.bakerlab.org/result.php?resultid=1241936
ID: 4441 · Report as offensive    Reply Quote
1 · 2 · Next

Message boards : RALPH@home bug list : minirosetta v1.47 bug thread



©2024 University of Washington
http://www.bakerlab.org