Posts by feet1st

41) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4476)
Posted 22 Jan 2009 by Profile feet1st
Post:
Greg, this is one of the BOINC debug messages. You get them be setting up a cc_config.xml file. In this case you need "checkpoint_debug" set to 1. You also need at least the first three set to 1.

Otherwise, the checkpoints are pretty transparent. But, in my case, my BOINC data directory is over on my second drive and it cannot spin down when idle now, because it is never idle. Normally, the drive is set to spin down when not in use, and then every 15-30 minutes BOINC kicks in and wants to write something and it spins up to do so, then goes back to sleep. The timer that makes the drive sleep is longer then the time between checkpoints with this new app.

It really should be honoring the BOINC setting for "write at most". I'm not clear why, but many projects do not honor that setting. They take then checkpoints as they are able, regardless. ...which was fine, until they started checkpointing every 2 minutes :)
42) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4474)
Posted 21 Jan 2009 by Profile feet1st
Post:
More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace.


1/20/2009 3:31:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:32:28 PM|ralph@home|[checkpoint_debug] result


Hmm ok, i'll look into this.


Ya, this afternoon I've been running two tasks for about 4.25hrs and I've got 450 checkpoint taken messages in my messages tab. Sometimes showing two checkpoints on same task within the same second.

Task names:
test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0
test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0
(both are running v1.50)
43) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4470)
Posted 21 Jan 2009 by Profile feet1st
Post:
so i take it this test is coming to an end?


Don't rush it! There were still 1000 failures and 3000 successes. I'm not sure what time the last of the DB packaging problems were cleared through. But, you've barely given any time for a 24hr runtime to complete.

Greg. Typically, they release a few tasks, if no clear problems like the DB packaging, and successes start coming in, then they release a few thousand tasks over the course of a day or so. And then, when they've made the final adjustments, explained most of the reported errors and confirmed the results, then they do a final push of 10,000+ tasks (again over the course of a day or two) to really seek out those rare and intermittant problems. THEN send it over to Rosetta.
44) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4460)
Posted 20 Jan 2009 by Profile feet1st
Post:
More checkpointing is great! But... this is a bit extreme. My write to disk at MOST every... setting is at 1800 seconds. My harddrive will never be able to spin down and go in to power saver mode all night long if the checkpoints continue at this pace.


1/20/2009 3:31:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:32:28 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:32:39 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:33:14 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:33:22 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:34:02 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:34:04 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:34:42 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:34:44 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:35:22 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:35:28 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:36:04 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:36:17 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:36:44 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:36:58 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:37:24 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:37:39 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:38:04 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:38:21 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:38:43 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:39:02 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
1/20/2009 3:39:22 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t332__IGNORE_THE_REST_1X7OA_6_6823_1_0 checkpointed
1/20/2009 3:39:44 PM|ralph@home|[checkpoint_debug] result test_cc_1_8_nocst4_hb_t342__IGNORE_THE_REST_2G0QA_13_6824_1_0 checkpointed
45) Message boards : RALPH@home bug list : minirosetta v1.47 bug thread (Message 4427)
Posted 18 Dec 2008 by Profile feet1st
Post:
My t328 mammoth is still on model 1 step 931,000 after nearly 17hrs ... and, of course, it's time to reboot to install MS fixes! ...wish me luck!

[update]
Interesting... it restarted on model 1, step 0 (yes, I waited for it to initialize and start incrementing steps) but with 2hr15min of CPU time on it. So, it's like it did take a checkpoint... only it didn't. Should be an interesting output file!
46) Message boards : RALPH@home bug list : minirosetta v1.47 bug thread (Message 4425)
Posted 17 Dec 2008 by Profile feet1st
Post:
It looks like this task 'hijacked' my PC until it was finished. Should it behave like this?


Sometimes it can seem that way. Ralph has short (3 day) deadlines, and so can easily find itself running "at high priority" on the BOINC list.

The other way this can happen is that BOINC tries to switch projects at checkpoints to preserve all the work possible (even for those not keeping tasks in memory). And some of these long running models do not take checkpoints. So BOINC was sitting there thinking it was just 10 min. from being done, and seeing no checkpoint to cut in on, so it just kept running it.

Another other way this can happen is if you rack up debt to Ralph when no work is available. BOINC knows it "owes" time to Ralph and so keeps running it.
47) Message boards : RALPH@home bug list : minirosetta v1.46 bug thread (Message 4418)
Posted 15 Dec 2008 by Profile feet1st
Post:
I generally don't understand why credit is granted on a model basis

*DO* be sure to discuss the "credit wars" with DK! He'll show you the scars.

The sooner i can convince myself that things are working fine i'll move stuff over to BOINC.

...and this explains why so many bugs make it through to Rosetta! It is EASY to convince yourself your work is perfect. Let's raise the bar a little higher. I set my Ralph runtime preference to 24hrs. SOMEONE needs to test the maximum. And if my task fails, it gets dropped in to an average and by the time my result is returned for analysis which would show that all 24hr runtimes are failing... the work is already out causing problems on Rosetta.

SO we'd have to relaunch *exactly* the same task with the same random seed. Not impossible, but quite a bit of bookkeeping.


No, actually, you'd build a new task, which runs (and runs differently) only one specific model (or a list of specific models). So, it wouldn't use the seed.

We could try and send the longer WUs to faster computers and the shorter ones to slower PCs, that way balancing the load a little bit. I mean the most major problems i think arise when an older machine gets a particularily difficult job.


This would also require changes to scheduler. And/or definition of subprojects.
48) Message boards : RALPH@home bug list : minirosetta v1.46 bug thread (Message 4415)
Posted 15 Dec 2008 by Profile feet1st
Post:
...as I said, well acquinted with the problem.

The entire credit system is model-centric. We've always maintained that some models are longer then others, but over time it would average out. ...but that was before anyone really had specific data on the degree of variability in per model runtimes. If you could produce a chart (or set of charts for a batch) that would show us more specifically what we're talking about, that would be great.

Many volunteers have very low tollerance for being "ripped off", and not being granted "fair" credit for their work (ask DK about the "credit wars").

some ideas:

Grant more credit for long-running models
I would hope you obfuscated the model runtimes enough in the task results that it would be very difficult to locate and falsify them. Assuming that is the case, it should be feasible to prorate some additional credit for long models, and perhaps a negative adjustment for short models.

Revise the expected per-model runtime guidelines
How can people have any confidence that some of the odd behaviors are "normal", if their observations are outside of the project's own published definition of what normal is. Why not publish details in the active WU log? "...this batch includes the following 20 proteins, with the following observed min/max/avg time seen on Ralph"

Increase the minimum runtime preference allowed
...as already under discussion on Rosetta boards. The longer the runtime, the less likely the user observes any significant variation or "slow down". One of the main arguments against long runtimes seems to be the project wants to get results back ASAP. ...but how often do you actually look at the data within 24hrs of it's arrival? So, it COULD have run for 24hrs longer and still been reported in by the time you are actually using the data.

Support trickle-up messages
Why not report each and every model completed in trickle-up messages? That way you get instant results, and reduced scheduler load, and I get minimum download bandwidth because I run the same task for several days if I like.

Support partial completions
Each time a checkpoint is completed, check how we're doing on runtime. If next checkpoint will exceed runtime preference, report the task back with it's last checkpoint. And devise a process for redistributing that to another host for completion. In other words, move all end of model checking to each checkpoint.

Revise scheduler to be RT aware
You have a pretty good idea (from Ralph results if nothing else) how long models will take for a given protein within a batch. If model runtime exceeds runtime preference for a host, then find another task to assign to it which has shorter models.

Become aggressive cutting off long searches
Rather then throwing away data when you cutoff a search, why not build new tasks based on that knowledge? Have them report higher FLOPS estimates, and send out a new task explicitly to pursue this model in far greater detail then you do at present. So the preliminary task will cut earlier then it does at present, and the rework task will cut later.

Support opt-in to specific subprojects
Several people have suggested you grant some control to users as to what they want to crunch. WCG does this. You could make long models, or model rework tasks be a seperate subproject choice. You could make the default for new subprojects be that only users that "opt-in to new" will crunch them unless they explicitly opt-in the the specific new subproject. Then over time, only once the error rate reaches a defined low threshold, does the default expand to anyone that has not explicitly opted-out.

Need more ideas, or clarification, say the word.
49) Message boards : RALPH@home bug list : minirosetta v1.46 bug thread (Message 4411)
Posted 13 Dec 2008 by Profile feet1st
Post:
I'm well familiar with the runtime per task vs per model, and defaults and expectations... in my case, I don't really have what one would call a slower computer, I have plenty of memory, and I have 24hr runtime preference so I am expecting 24hrs total... but 4hrs per model exceeds the guidelines described here.

So, either the guideline should change, or the tasks should change.

The task now has 11.5 hours in and is on model 5. So it got 3 more models done in just over 7 hours. In other words, that first model took twice as long as the others. This will cause significant variation in credit, depending on which model you were assigned.

Has the reporting of runtime on a per model basis now been completed, so the tasks are reporting back that level of detail?
50) Message boards : RALPH@home bug list : minirosetta v1.46 bug thread (Message 4404)
Posted 13 Dec 2008 by Profile feet1st
Post:
It completed model 1 after 4 hours.
51) Message boards : RALPH@home bug list : minirosetta v1.46 bug thread (Message 4403)
Posted 13 Dec 2008 by Profile feet1st
Post:
Wow! I'm only 3 hours in to a 24hr preference and this one has already generated 14.8 million page faults! It is called:
cc2_1_8_mammoth_cst_hb_t315__IGNORE_THE_REST_1YIXA_3_6349_1_0

There also appear to still be scaling problems on the accepted energy... and I'm still on model 1, step 361,000, so it looks like long running models.

Peak memory usage shown is 305MB.

Response time on Rosetta message boards is about 2 minutes right now.
52) Message boards : RALPH@home bug list : minirosetta v1.43 bug thread (Message 4381)
Posted 3 Dec 2008 by Profile feet1st
Post:
Really ? Aehm - i' say 177MB of memory is actually fairly *low*. Over 250 MB i consider a high memory job. I thoguht the minimum requirement for R@H Wus is 250 MB ? David ?


256MB is the minimum for the SYSTEM! Not the max for a task. Leave some room for an operating system and a browser window or two there.

177MB is above average, which tends to be closer to 120MB. So, my point is that we're sitting here observing the ~60MB increase, and not knowing if you are already aware of it, or if the tasks have been properly set up to only run on machines with more then the minimum requirement. ...I guess if I had a machine with the minimum memory and saw it, then I would know it needs to be pointed out. But, as it stands, I have no way to tell.
53) Message boards : RALPH@home bug list : minirosetta v1.43 bug thread (Message 4374)
Posted 3 Dec 2008 by Profile feet1st
Post:
P.S. I'm the first to admit that if the graphic is the ONLY problem I can report, then things are looking great!

I'm on WinXP. I tried suspending/resuming tasks and projects, they seem to stop using CPU on command. As they should.
54) Message boards : RALPH@home bug list : minirosetta v1.43 bug thread (Message 4373)
Posted 3 Dec 2008 by Profile feet1st
Post:
Really liking the stats on the homepage!! (down to about a 9% failure rate)

I'm 22hrs in to a 24hr run on this guy:
fast_ramp_0.01_rep_16_rlb_1o4w_IGNORE_THE_REST_DECOY_5787_1_0

First off... wow! 64 models so far, and this is a large protein!

I brought up the graphic... all I see graphed is the RMSD on the right. No energy, and no cross hair of the two. Also... haven't seen any of the red dots of any of the prior 64 models, which seems unlikely to be correct.

She's running with 177MB of memory. I presume it's already classified as a high memory task... but I really wish you could find room in those short little task names to provide some indication of a tasks minimum memory. I mean I've got 1GB for an HT processor, can't say I ever have memory problems... and I can end up crunching most anything you put out... but it would be nice if we could see in the WU name that this one only goes to machines with 512MB or whatever, so we know NOT to report "high memory problems" on the task.
55) Message boards : RALPH@home bug list : Ralph v1.42 (Message 4367)
Posted 27 Nov 2008 by Profile feet1st
Post:
These details are what will be needed to release on Rosetta, right? So, if we're testing here, let's be testing the descriptions as well, and if they don't make sense to lay-people, or whatever, we can work through those kinks as well.

Those descriptions are great, but perhaps the summarized version that will appear in the Rosetta news box would be good too "Brought memory usage down for tasks that were previously using more then normal, reduced per-model runtimes for most of the previously long-running models resulting in more consistent runtimes, additional refinements to the modeling logic...more details here"

What about the problem where suspended tasks keep running? I've not seen it occur here, but is that because it has been addressed? Or, luck of the task draw?
56) Message boards : RALPH@home bug list : Ralph v1.42 (Message 4359)
Posted 25 Nov 2008 by Profile feet1st
Post:
What bugs/issues do you feel you have resolved?

Are there steps we can take to test any of the frequently encountered problems seen on Rosetta?
57) Message boards : RALPH@home bug list : Bug Reports for Minirosetta v1.38 (Message 4295)
Posted 24 Oct 2008 by Profile feet1st
Post:
FYI, v1.39 is still showing the issue where BOINC Manager is unable to suspend the task to run other projects. And so I'm still seeing my HT processor running 3 tasks at the same time on Win XP Pro.

I was able to suspend the task, it continued to run. I then resumed it, and of course it continued to run. But then when I suspended it again, it did stop normally.
58) Message boards : RALPH@home bug list : Bug Reports for Minirosetta v1.38 (Message 4293)
Posted 24 Oct 2008 by Profile feet1st
Post:
Nearly 38 hrs on a 24hr preferred runtime. Still on model one.

hombench_olange_foldcst_single_alignment_t374_
foldcst_single_alignment_t374__IGNORE_THE_REST_2FCKA_3_5191_5_0
Workunit 1006002
59) Message boards : RALPH@home bug list : Bug Reports for Minirosetta v1.38 (Message 4287)
Posted 22 Oct 2008 by Profile feet1st
Post:
Chu! Long time no "see"! Welcome back.

What's with the little red ball?

Make this thread sticky. Maybe unsticky some others.
60) Message boards : RALPH@home bug list : Bug Reports for Minirosetta v1.36 (Message 4269)
Posted 13 Oct 2008 by Profile feet1st
Post:
This one ran 42hrs... still on model 1. I ended and restarted until the Watchdog turned it in.

hombench_mtyka_looprelax_ccd_moves_looprelax_ccd_moves_t328__IGNORE_THE_REST_2CG4A_13_5095_1_0


Previous 20 · Next 20



©2024 University of Washington
http://www.bakerlab.org