minirosetta beta 3.50-3.52 apps

Message boards : RALPH@home bug list : minirosetta beta 3.50-3.52 apps

To post messages, you must log in.

Previous · 1 · 2 · 3 · 4 · Next

AuthorMessage
Mad_Max

Send message
Joined: 15 Nov 12
Posts: 12
Credit: 386,146
RAC: 47
Message 5754 - Posted: 20 Jul 2014, 11:48:44 UTC

Wus with names
Tc794_hybrid...
Tc804_summ_hybrid...

Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours)
Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs
AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU
ID: 5754 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5755 - Posted: 20 Jul 2014, 11:59:41 UTC

As i say, continuos restart
20/07/2014 12:50:54 | ralph@home | Task Tc804_symm_hybrid_20174_15261_0 exited with zero status but no 'finished' file

I reset the project, without result
ID: 5755 · Report as offensive    Reply Quote
Profile Conan
Avatar

Send message
Joined: 16 Feb 06
Posts: 362
Credit: 1,368,421
RAC: 179
Message 5756 - Posted: 20 Jul 2014, 13:39:24 UTC - in response to Message 5754.  

Wus with names
Tc794_hybrid...
Tc804_summ_hybrid...

Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours)
Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs
AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU


Each work unit will run until at least 1 Decoy has been run.
If it takes longer than your set preferences then it keeps going till at least 1 Decoy finishes.
Setting to 2 hours will work most of the time.
The 5-6 hour run times happen to suit my 6 hour preference setting, but even then I have had some go over 7 hours.

I have not tested the check-pointing so I can't test if the work units restart from zero like you are seeing.


Conan





ID: 5756 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 16
Credit: 2,855
RAC: 0
Message 5757 - Posted: 20 Jul 2014, 15:24:06 UTC
Last modified: 20 Jul 2014, 15:48:19 UTC

checkpoint fail
TC804_symm_hybrid, Rosetta mini 3.52 - linux x64 - boinc client 7.0.36
7 concurrent (same) tasks starts and runs for an hour, reached 25% completion
suspended project, shutdown boinc-client (note, done via boinc-gui)

restarted project, out of the 7 tasks only 1 restarted from 25% completion, the rest of 6 tasks restarted from 0. in effect lost 6 x 1 hours work.

aborted 3 tasks, resume runs on half load 4 tasks

checkpoint preferences set for 60 secs. not sure where's the root cause. (boinc-client, rosetta app, or the parameters used to run the app

e.g. if there are no structures and the session is interrupted that'd effectively means if the task/job is restarted it'd start from zero all over?
hmm, perhaps something to be considered and improved on

i'd think rosetta need to save the state even if no structures are generated esp for such large?/complex? jobs where for that matter there may be no structures (i.e. the run did not find a root/solution/model)

the other thing would be if some jobs hits a 'dead end' (runs for hours without finding solutions, perhaps goes into endless chaotic loops), there'd hence be no structures & no credits would be awarded/claimed?

i'd think participants need to have influence on the max default run time per task, i.e. some participants would not be too happy to crunch perhaps jobs that runs say for 5-6 hours and not find a solution and hence no credits. if this is not possible, then participants may simply need to abort long jobs that goes beyond the 'normal' (say compared to average of all other jobs) durations

another way i'd guess is the necessity to award credits/allow claimed credits to the 'no solution' (no models) runs where after the 'reasonable' timeframe no solutions are found. i guess the max default run time is 6 hours, hence, app developer should consider terminating with credits for such cases.

however, for many participants with a fairly recent cpu that runs somewhat 'fast', after 3-4 hours where there are no solution, the participant may not want to continue the run. hence, participants need to have a 'computing preference' to state that the max default run time preferred is hence say 4 hours.
ID: 5757 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 16
Credit: 2,855
RAC: 0
Message 5758 - Posted: 20 Jul 2014, 16:42:52 UTC
Last modified: 20 Jul 2014, 17:08:29 UTC

remaining 4 Tc804_symm_hybrid work units completed successfully no errors
http://ralph.bakerlab.org/result.php?resultid=3014838, 4,763.77 cpu secs
http://ralph.bakerlab.org/result.php?resultid=3014871, 4,601.52 cpu secs
http://ralph.bakerlab.org/result.php?resultid=3014874, 4,476.31 cpu secs
http://ralph.bakerlab.org/result.php?resultid=3014875, 8,502.676 cpu secs
cpu time varies as it seemed, after all psuedorandom numbers are involved in the simulations/solution search

*caveat*
those that clocked 4k cpu secs could be jobs that show up as 0% in bonic-client / gui after the shutdown interruption. that may suggest some bugs (not sure where/which app boinc-client, or rosetta) in updating the state xml statistics files. i.e. bonic-client/gui restarted showing 0%, however, rosetta probably did save the state and hence 3 jobs suggestively ended in half the timeframe. i.e. the statistics for the cpusecs is incorrect, those 3 jobs actually ran for 8k cpu secs. there are 4000 cpu secs 'lost' for each of the 3 jobs before the suspend project / boinc-client shutdown interruption, this is more like a missing statistics update. However, what could be postulated is that rosetta did checkpoint and resumed from the interruption, hence the 3 jobs suggestively completes in 4k cpu secs as the first half of the cpu secs statistics is 'lost'. i.e. if rosetta did not checkpoint, what would have showed up would be 8k cpu secs and the actual total cpu secs would be 12k cpu secs for those jobs

i guess i'd upgrade my boinc-client to see if that'd resolve the issue

---
note that this has major impact to credits claimed / granted. as the 'lost' cpu secs would suggest that that job can be done in 1/2 the cpu secs (i.e. half of 8 k actual) which is *incorrect*
ID: 5758 · Report as offensive    Reply Quote
TPCBF

Send message
Joined: 20 Jun 11
Posts: 29
Credit: 24,677
RAC: 39
Message 5759 - Posted: 20 Jul 2014, 21:31:25 UTC - in response to Message 5758.  

Same here, the 4 WUs I p/u on the 17th just keep restarting from 0% over and over again and each time, at least during the initial time, are trashing the hard drive like crazy...

Is anyone from the project actually around to monitor any responses. Or is Mr.Baker & Cie only available when there's a chance to bask in the limelight?

Ralf
ID: 5759 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5760 - Posted: 21 Jul 2014, 6:54:34 UTC - in response to Message 5759.  

Is anyone from the project actually around to monitor any responses. Or is Mr.Baker & Cie only available when there's a chance to bask in the limelight?


You're too mocking....but you've some reasons. There is a big "lack of comunications" with r@h team.
ID: 5760 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5761 - Posted: 21 Jul 2014, 20:47:19 UTC

Wow, 20 points for over 6h of run :-P
# cpu_run_time_pref: 7200
BOINC:: CPU time: 22084.9s, 14400s + 7200s[2014- 7-21 22:38:48:] :: BOINC
WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.
Writing W_0000001
======================================================
DONE :: 1 starting structures 22084.9 cpu seconds
This process generated 1 decoys from 1 attempts
======================================================
called boinc_finish

</stderr_txt>
]]>

Validate state Valid
Claimed credit 78.0462089549509
Granted credit 20
ID: 5761 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 16
Credit: 2,855
RAC: 0
Message 5762 - Posted: 22 Jul 2014, 1:33:10 UTC - in response to Message 5761.  
Last modified: 22 Jul 2014, 1:55:10 UTC

Wow, 20 points for over 6h of run :-P

Validate state Valid
Claimed credit 78.0462089549509
Granted credit 20


based on what i understand from rosetta@home message boards, the granted credit which tend to be different (not necessarily lower) is apparently due to averages being used. i've observed cases where granted credit > claimed credit

i.e. every participant's PC claim a certain number of computed credits (this is the actual cpu work done), if there is no fraud claim credits is actually *accurate*. however what's granted is the average

apparently as i've posted earlier in this thread there are bugs in boinc client, in my case if i suspend the jobs and shutdown the clients and restart them later, statistics for the initial run could be lost. however, apparently rosetta did checkpoint successfully and resumed from the point it is restated. hence, if the task is say 100 credits, and if the shutdown occur at 99 credits worth of cpu time, when i restart those tasks that's affected by the boinc client bug, it would complete that and claim 1 credit. that would *wrongly* imply that a 100 credits job can be done in 1 credit effort (this is completely inaccurate)

rosetta should built-in in the formula to reject out of band claimed credit for jobs. this can be done by taking standard deviations and rejecting those falling more than one or 1.96 (95% confidence interval, http://en.wikipedia.org/wiki/1.96) standard deviations below the statistical averages when computing the granted credits. that should result in the higher claim credits being averaged to reflect the true effort needed to complete the tasks

hope admins consider that and enhance the server codes.
rosetta@home/ralph@home should not be 'stingy' with credits as those are *true work done* and the project as a whole is competing with other boinc projects to show that they are popular projects that's getting the participant's attention (it is a very good form of free advertising for the project)
ID: 5762 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5763 - Posted: 22 Jul 2014, 8:52:51 UTC - in response to Message 5762.  

i've observed cases where granted credit > claimed credit

I know, i know the "problem" granted/claimed. I hope you realize i don't partecipate for credits :-)

hope admins consider that and enhance the server codes.

This request is repeated frequently on Rosetta forum.
Other volunteers have suggested that it is a problem of customization of actual server. But admins have not said anything.
ID: 5763 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 16
Credit: 2,855
RAC: 0
Message 5764 - Posted: 22 Jul 2014, 15:57:54 UTC - in response to Message 5763.  
Last modified: 22 Jul 2014, 16:19:14 UTC

i've observed cases where granted credit > claimed credit

I know, i know the "problem" granted/claimed. I hope you realize i don't partecipate for credits :-)

hope admins consider that and enhance the server codes.

This request is repeated frequently on Rosetta forum.
Other volunteers have suggested that it is a problem of customization of actual server. But admins have not said anything.


strictly speaking i'm speculating a possible factor for the low credits is mainly an *instance* caused (most likely) by faulty boinc-client s/w. as statistics is 'lost' on a shutdown/restart, it 'mis-reports' credits to the server. as the claimed credits is much lower after the restart it affects the average credits that's awarded to the task and any later participants. while i normally ignore them (as like u credits aren't really the purpose to crunch rosetta), it may make some participants unhappy about the low granted credits esp for those who pick up the subsequent same jobs. the solution of course is to fix my (an instance of) faulty boinc client, but i'm just putting in my 2 cents reasoning on the 'collerateral damage' that others may observe. my guess is that this issue may be partially alleviated from the server if the server ignores exceptionally low credits when computing the granted credits.

i'm not too sure if there may better way to award 'credits', however, taking an average of reported claimed credits is after all a good way to measure the work done statistically averaged across different systems. just that in this 'simple' points(credit) system, it is prone to be affected by 'mis-behaving' clients. i guess there really aren't perfect solutions

i'd soon upgrade my client, hopefully that'd 'fix' some of the statistical issues from my little leaf node
ID: 5764 · Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 15 Nov 12
Posts: 12
Credit: 386,146
RAC: 47
Message 5765 - Posted: 22 Jul 2014, 21:49:14 UTC
Last modified: 22 Jul 2014, 22:02:32 UTC

It is NOT "faulty boinc-client s/w." OR " statistics is 'lost' on a shutdown/restart"

It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart.
So BOINC software do right when reset statistic and credits to zero too because: 0 useful work done = 0 Cr

Also some of WUs run so long (possible algorithm looped infinitely or just very difficult model to calculate) so even after 5-7 hours of running(without interruptions / restarts) on modern CPU can not finish very first model (decoy).

In this situations claimed credits (calculated by BOINC client) will be normal. But granted credit actually = 0, because R@H for granted credit use such formula:
average claimed credit per 1 decoy (collected and calculated from prev users who report WUs from same batch) multiply by number of decoys reported in particular task of a specific user.
So if decoy count = 0, granted credits = 0 too.

But later programmers added exception: if user report task with decoy count = 0 not use general formula (which gives 0 Cr) but reward WU with fixed 20 Cr as some sort of consolation/booby prize.
ID: 5765 · Report as offensive    Reply Quote
Mad_Max

Send message
Joined: 15 Nov 12
Posts: 12
Credit: 386,146
RAC: 47
Message 5766 - Posted: 22 Jul 2014, 21:59:23 UTC - in response to Message 5756.  

Wus with names
Tc794_hybrid...
Tc804_summ_hybrid...

Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours)
Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs
AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU


I have not tested the check-pointing so I can't test if the work units restart from zero like you are seeing.

Conan

To roughly check works of checkpoints not necessarily to restart.
You can click "properties" of any of the currently executing task and check the line "CPU time at last checkpoint".
If checkpoints saving are working normal there will be the time(counted from start of task) of last checkpoint saved. If checkpoint does not work there will be "-- --" on this line.
Or time few hours ago/less compared to total CPU time - if the client could finish at least one model completely and recorded it on a disk - it also counted as checkpoint and usual this part work normal.
ID: 5766 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5767 - Posted: 23 Jul 2014, 6:40:03 UTC - in response to Message 5765.  
Last modified: 23 Jul 2014, 6:41:11 UTC

It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart.


Yeap, i know the "situation" decoy/checkpoint
My problem is that some (not a few) wus restart during crunching, without restart of pc/boinc manager and, i repeat, with this message:
Task Tc804_symm_hybrid_20174_15261_0 exited with zero status but no 'finished' file
ID: 5767 · Report as offensive    Reply Quote
cmt.explorer

Send message
Joined: 24 Jul 14
Posts: 2
Credit: 95
RAC: 0
Message 5768 - Posted: 24 Jul 2014, 13:18:15 UTC

Greetings,

what is meant in the news post to this thread by "If you have an android arm device/phone that supports android-9, ..."? What should "android-9" be?

I tried to start a workunit on Android 4.4.4 with the NativeBoinc client but it didn't work - additonally I got the message "Rosetta Mini is not availiable for your type of computer.

Any ideas?

Thanks!
ID: 5768 · Report as offensive    Reply Quote
TPCBF

Send message
Joined: 20 Jun 11
Posts: 29
Credit: 24,677
RAC: 39
Message 5769 - Posted: 25 Jul 2014, 2:41:26 UTC - in response to Message 5766.  

To roughly check works of checkpoints not necessarily to restart.
You can click "properties" of any of the currently executing task and check the line "CPU time at last checkpoint".
If checkpoints saving are working normal there will be the time(counted from start of task) of last checkpoint saved. If checkpoint does not work there will be "-- --" on this line.
Or time few hours ago/less compared to total CPU time - if the client could finish at least one model completely and recorded it on a disk - it also counted as checkpoint and usual this part work normal.
The problem with the current checkpoint setting in the WUs is that the recent batch of WUs seem to reset itself a lot, always starting from scratch instead of being able to continue from the last checkpoint. That's the purpose of checkpoints.
As it is currently, a lot of processing power get's wasted this way...

Ralf
ID: 5769 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5770 - Posted: 25 Jul 2014, 6:26:36 UTC - in response to Message 5768.  

What should "android-9" be?

I'm not sure, but i think it's the version of api. ApiLevels

I tried to start a workunit on Android 4.4.4 with the NativeBoinc client but it didn't work - additonally I got the message "Rosetta Mini is not availiable for your type of computer.
Any ideas?

The first version running on android is 3.53. Now there is a batch of 3.52 so it is normal this message.

For Admins: i know 3.53 is the first version, but can you optimize it, please? This version uses a lot of ram, a lot of disc space, continuous restarts...
ID: 5770 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5771 - Posted: 25 Jul 2014, 6:27:04 UTC - in response to Message 5769.  

As it is currently, a lot of processing power get's wasted this way...


+1
ID: 5771 · Report as offensive    Reply Quote
sgaboinc

Send message
Joined: 8 Jul 14
Posts: 16
Credit: 2,855
RAC: 0
Message 5772 - Posted: 27 Jul 2014, 15:25:56 UTC - in response to Message 5765.  
Last modified: 27 Jul 2014, 16:11:12 UTC

It is NOT "faulty boinc-client s/w." OR " statistics is 'lost' on a shutdown/restart"

It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart.
So BOINC software do right when reset statistic and credits to zero too because: 0 useful work done = 0 Cr



hi Max,
Thanks much for your post, I think i can confirm your observation:
There is no checkpoint !


all 6 concurrent ralph@home tasks did not checkpoint after running for more than an hour.
this is a screen print, time of last checkpoint is --
and elapsed time is some 1 hour 15 minutes

compared to a a concurrently running task from rosetta@home


the rosetta@home task is checkpointing well as indicated by the time of last checkpoint

note that apparently the minirosetta 3.52, 2.53 (beta) binaries running on ralph@home and rosetta@home are the same
http://ralph.bakerlab.org/forum_thread.php?id=557&nowrap=true#5753

note that all these ralph@home and rosetta@home sessions are running concurrently in the same boinc-client (7.0.36) !


some error in the job run parameters or that it's necessary to improve minirosetta to make such complex jobs/tasks checkpoint?
ID: 5772 · Report as offensive    Reply Quote
Profile [VENETO] boboviz

Send message
Joined: 9 Apr 08
Posts: 579
Credit: 1,018,369
RAC: 661
Message 5773 - Posted: 28 Jul 2014, 18:29:23 UTC

Validate errors:
3052012
3052013

ID: 5773 · Report as offensive    Reply Quote
Previous · 1 · 2 · 3 · 4 · Next

Message boards : RALPH@home bug list : minirosetta beta 3.50-3.52 apps



©2018 University of Washington
http://www.bakerlab.org