Posts by Aegis Maelstrom

1) Message boards : RALPH@home bug list : Minirosetta 1.83-1.85 (Message 4910)
Posted 5 Aug 2009 by Aegis Maelstrom
Post:
Not exactly an error but these WUs took a pretty long time, like nearly 6 hrs each for just one decoy on a Pentium M.

We may call it a long running WU. I have recently seen such things in Rosetta as well - I guess it's a question of a new science.

1562765
1562881
1563010
1563093
2) Message boards : RALPH@home bug list : minirosetta 1.90 (Message 4909)
Posted 5 Aug 2009 by Aegis Maelstrom
Post:
Yet another


ERROR: Unable to open weights. Neither ./dslf_weights.wts nor dslf_weights.wts nor minirosetta_databasescoring/weights/dslf_weights.wts exist
ERROR:: Exit from: ....srccorescoringScoreFunctionFactory.cc line: 177
BOINC:: Error reading and gzipping output datafile: default.out
called boinc_finish


Me and another crunchers of these WUs.

Task 1
Task 2
Task 3

Others are OK.
3) Message boards : Number crunching : Can't report work, Server Error : Can't attach shared memory (Message 4908)
Posted 5 Aug 2009 by Aegis Maelstrom
Post:
Servers keep being down...

It is after my deadline and I still can't report the job.
4) Message boards : RALPH@home bug list : minirosetta 1.58 (Message 4657)
Posted 2 Feb 2009 by Aegis Maelstrom
Post:
@Feel1st: Hi there, I haven't noticed it before but I have the same on my widescreen, Win XP SP2, task lr6_E_score12.

It looks as if the top of the Ralph graphics is consumed by the title (usually blue) bar of the viewer application.

The same model when displayed as a screensaver (without the title bar) looks O.K.
5) Message boards : RALPH@home bug list : minirosetta v1.55 bug thread (Message 4637)
Posted 1 Feb 2009 by Aegis Maelstrom
Post:
- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005124B3 write attempt to address 0x3FF00000


Yea! Two of us with the same address!

Maybe it is not random after all .... :)


The same here - lr6_D_score12_rlbn_1bm8_IGNORE_THE_REST_NATIVE_7059_5_1.

- Unhandled Exception Record -
Reason: Access Violation (0xc0000005) at address 0x005124B3 write attempt to address 0x3FF00000


The same as well had Manuel Lupotto above.
Both of us had MiniRosetta ver. 1.56.
6) Message boards : RALPH@home bug list : minirosetta v1.55 bug thread (Message 4623)
Posted 31 Jan 2009 by Aegis Maelstrom
Post:
Version 1.55, Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0.

Repeated "exited with zero status but no 'finished' file" problem.
BOINC logs:

2009-01-31 00:48:10|ralph@home|Starting _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0
2009-01-31 00:48:22|ralph@home|Starting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
(...)
2009-01-31 01:04:19|ralph@home|Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 exited with zero status but no 'finished' file
2009-01-31 01:04:19|ralph@home|If this happens repeatedly you may need to reset the project.
2009-01-31 01:05:12|ralph@home|Restarting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
(...)
2009-01-31 01:14:32|ralph@home|Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 exited with zero status but no 'finished' file
2009-01-31 01:14:32|ralph@home|If this happens repeatedly you may need to reset the project.
2009-01-31 01:14:49|ralph@home|Restarting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
(...)
2009-01-31 01:31:09|ralph@home|Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 exited with zero status but no 'finished' file
2009-01-31 01:31:09|ralph@home|If this happens repeatedly you may need to reset the project.
2009-01-31 01:32:25|ralph@home|Sending scheduler request: To fetch work. Requesting 1173 seconds of work, reporting 0 completed tasks
(...)
2009-01-31 01:48:42|ralph@home|Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 exited with zero status but no 'finished' file
2009-01-31 01:48:42|ralph@home|If this happens repeatedly you may need to reset the project.
2009-01-31 01:49:46|ralph@home|Restarting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
(...)
2009-01-31 02:06:39|ralph@home|Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 exited with zero status but no 'finished' file
2009-01-31 02:06:40|ralph@home|If this happens repeatedly you may need to reset the project.
2009-01-31 02:07:45|ralph@home|Restarting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
(...)
2009-01-31 02:15:11|ralph@home|Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 exited with zero status but no 'finished' file
2009-01-31 02:15:11|ralph@home|If this happens repeatedly you may need to reset the project.
2009-01-31 02:15:28|ralph@home|Restarting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
(...)
2009-01-31 02:31:56|ralph@home|Task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 exited with zero status but no 'finished' file
2009-01-31 02:31:56|ralph@home|If this happens repeatedly you may need to reset the project.
2009-01-31 02:31:56|ralph@home|Temporarily failed upload of _CAPRI17_T39_2_.sjf_br_one_docking.protocol__7228_256_0_0: connect() failed
2009-01-31 02:33:18|ralph@home|Restarting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
(...)
.

As you may see, quite a waste of computing time. Before of that, other CAPRI17 task finished without any visible problems.

Finally the scheduler closed a time window for RALPH and started another project.

It was over 24 minutes, over 17% completed (however this last number is not really meaningful). What is interesting, boinccmd.exe --get_results claimed it was current CPU time 1455 sec, such as final CPU time, but the checkpoint CPU time was 1252 sec.

Finally I turned the RALPH on in the morning just to see what happens:

2009-01-31 11:34:25|ralph@home|Restarting task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 using minirosetta version 155
2009-01-31 11:34:39|ralph@home|Computation for task _CAPRI17_T39_1_.sjf_br_both_docking.protocol__7228_269_0 finished
.

It has finished with a so-called success, however with 2 decoys, low credit and a "Too many restarts with no progress. Keep application in memory while preempted." notice.

I hope it helps.

Best Regards and have a nice weekend!
a.m.
7) Message boards : RALPH@home bug list : minirosetta v1.54 bug thread (Message 4560)
Posted 26 Jan 2009 by Aegis Maelstrom
Post:
Task testD_cc2_1_8_mammoth_mix_cen_cst_hb_t317__IGNORE_THE_REST_2G6ZA_8_7099_1_0.

Reports as a success, however I'm suspicious a bit.

I've seen the graphics only once: after a couple of minutes the protein was barely semi-folded and it looked a bit "sketchy" (low detail) and each step took something like a second or two... Also the percentage development was increasing IMO slower than in the case of previous tasks.

When I got back in the tenth minute, there was no graphics, only a blank screen. I've been trying several times afterwards, both waiting for the screensaver and using "show graphics" button. Nothing, blank black screen.

I've checked memory consumption of the system in the windows task manager; it was high, however still O.K. for the MiniRosetta (sth like 130 MB of RAM and 170+ of VM Size.

After that I've killed RAM consuming Firefox and wanted to check memory, graphics app etc. - only to see that the WU suddenly finished something like 2 minutes before.

Officially it reports a success and 1 finished decoy - but I am not sure if it has been really fully folded.
8) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4526)
Posted 24 Jan 2009 by Aegis Maelstrom
Post:
...and now something completely different.

Mike, do you need more information about actual behaviour of your predicting models? Just to fine tune the energy function, folding procedure etc.?

One thing we all now is that some models of particular WU take much more time than the others. It's clearly seen when we have "long running WUs", however it is quite often as well in WUs which have very short models. In these cases you need to actually watch it or have good logs to see, that, i.e. most of models run in 5 minutes, and one takes 15-20.

I've been having such a task right now: testC_cc_1_8_nocst4_hb_t288__IGNORE_THE_REST_2FNEA_2_6976_1_0.
For many minutes the accepted screen was showing a very simple chain split on two, longer and shorter, and the shorter was standing still, showing some quite folded structure. It took over 400,000 steps to get something semifolded and more complicated on a "accepted" screen and go further.
This model (12th) took much more time and as the forecast runtime of a model increased, it was the last decoy allowed by the scheduler.

All of these looked like the procedure needed much more time to hit something acceptable to start with.

I'm sure you know about this issue but I have seen such a behaviour a couple times before and just wanted to let you know it is quite common.

P.S. Besided that everything - except checkpointing, see my posts above - is working like a charm.

P.S.S. Great work Feel1st!
9) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4520)
Posted 24 Jan 2009 by Aegis Maelstrom
Post:
Feel1st, others: could you consider running a similar test to the one I have described above?

Or maybe you would find some further/better tests? I guess there is no better way to see if the checkpointing is really working.

And I am awfully sorry for my subprime English ;) - it's 4:30 a.m. here and I am rushing to my bed. :)
10) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4517)
Posted 24 Jan 2009 by Aegis Maelstrom
Post:
Hi,

currently I'm running sr213_t077_1_NMR_NESG_SAVE_ALL_OUT_6972_24_0.

- Do the graphics behave properly again ?

Yes, it is working - however there was a funny thing.
This WU has a native protein and RMSD - but once the RMSD part of the graphics (showing left/right how good the RMSD is) was not working. It was just blank.
The low energy part was O.K.

After turning off and turning on one-two minutes after, everything was just fine.

Regarding checkpointing - according to boinccmd.exe it does make checkpoints every 100+ - 200+ seconds. It is quite close to my preferences (120 sec).

EDIT: I have made an experiment to check the checkpointing and I have turned off the client and afterwards turned it back on.

The task resumed from over 26 minutes (and according to the boinccmd the checkpoint was 1591 seconds, so it would fit). However, the WU started from Model 1, from the step 0 and was moving forward.
On the other hand, the WU snaply went into SmoothFragmentMover_GunnCost and it didn't look that bad, so I am puzzled: I am not sure if it really started from the beginning, or if the checkpointing worked but the step information is misguiding.
I think it must be checked on some further step when the protein is really pretty and the difference is easy to tell - but now I must go sleep. :D


EDIT 2: I have repeated my experiment and I am pretty sure the checkpointing is not working correctly.

I've stopped the WU after 46 minutes (2524 seconds) when it was quite beautiful (minus three hundred something energy etc.) and restarted.

Yet again it started from step 0 and then it started to behave in a strange manner. There was no really folded protein, only one straight chain - but the program was clearly trying to use methods proper in the last stage of the process (namely MoverBase+Minimization). In a minute or so the chain was bent in something like two points and it got "fractalized" by the last step of the procedure (when you have all these small strings moving to get the lowest possible energy).

If you want, I can provide you with a snapshot (printscreen of the graphics).

Everything ended after circa 200 seconds (for a couple of seconds there was "stage: unknown") and got reported.

Have a nice lecture.

Interesting thing: it reduced the granted credit; claimed was 7.06, granted: 4.68.

Well, one must suffer for the science. =) And now let me get some sleep...


P.S. The only thing that collided with the experiment was that in a second time I had to halt a regular Rosetta task (I think I've halted the Rosetta but this time after restarting BOINC the manager didn't remember it, huh). I don't know if it could interfere - if it did it would mean there is some lack of reseting used variables.

If it were not for the experiment, I am pretty sure the task would run alright.
11) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4515)
Posted 24 Jan 2009 by Aegis Maelstrom
Post:
O.K., I got the new 1.53 version - when only 1.52 finishes we will see how it behaves. :)

EDIT: So far O.K., and the graphics is working...
12) Message boards : RALPH@home bug list : minirosetta v1.48-1.51 bug thread (Message 4510)
Posted 24 Jan 2009 by Aegis Maelstrom
Post:
Hi Mike, Hi Fellows!

I'm just running Mini 1.52, this task.

Actually, I've found that RALPH WU has started only because of a prompt alert: BOINC wanted to turn the screensaver on and I got the Windows massage about a runtime terror and sudden termination of 1.40 graphics viewer. The process has been killed by the system.

So far, the WU seems to be continuing, I'm on 3,84% (0:11:35) and running.






©2020 University of Washington
http://www.bakerlab.org