minirosetta beta 3.50-3.52 apps

Author	Message
Mad_Max Send message Joined: 15 Nov 12 Posts: 15 Credit: 404,700 RAC: 0	Message 5754 - Posted: 20 Jul 2014, 11:48:44 UTC Wus with names Tc794_hybrid... Tc804_summ_hybrid... Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours) Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU ID: 5754 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5755 - Posted: 20 Jul 2014, 11:59:41 UTC As i say, continuos restart 20/07/2014 12:50:54 \| ralph@home \| Task Tc804_symm_hybrid_20174_15261_0 exited with zero status but no 'finished' file I reset the project, without result ID: 5755 · Reply Quote

Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0	Message 5756 - Posted: 20 Jul 2014, 13:39:24 UTC - in response to Message 5754. Wus with names Tc794_hybrid... Tc804_summ_hybrid... Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours) Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU Each work unit will run until at least 1 Decoy has been run. If it takes longer than your set preferences then it keeps going till at least 1 Decoy finishes. Setting to 2 hours will work most of the time. The 5-6 hour run times happen to suit my 6 hour preference setting, but even then I have had some go over 7 hours. I have not tested the check-pointing so I can't test if the work units restart from zero like you are seeing. Conan ID: 5756 · Reply Quote

sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0	Message 5757 - Posted: 20 Jul 2014, 15:24:06 UTC Last modified: 20 Jul 2014, 15:48:19 UTC checkpoint fail TC804_symm_hybrid, Rosetta mini 3.52 - linux x64 - boinc client 7.0.36 7 concurrent (same) tasks starts and runs for an hour, reached 25% completion suspended project, shutdown boinc-client (note, done via boinc-gui) restarted project, out of the 7 tasks only 1 restarted from 25% completion, the rest of 6 tasks restarted from 0. in effect lost 6 x 1 hours work. aborted 3 tasks, resume runs on half load 4 tasks checkpoint preferences set for 60 secs. not sure where's the root cause. (boinc-client, rosetta app, or the parameters used to run the app e.g. if there are no structures and the session is interrupted that'd effectively means if the task/job is restarted it'd start from zero all over? hmm, perhaps something to be considered and improved on i'd think rosetta need to save the state even if no structures are generated esp for such large?/complex? jobs where for that matter there may be no structures (i.e. the run did not find a root/solution/model) the other thing would be if some jobs hits a 'dead end' (runs for hours without finding solutions, perhaps goes into endless chaotic loops), there'd hence be no structures & no credits would be awarded/claimed? i'd think participants need to have influence on the max default run time per task, i.e. some participants would not be too happy to crunch perhaps jobs that runs say for 5-6 hours and not find a solution and hence no credits. if this is not possible, then participants may simply need to abort long jobs that goes beyond the 'normal' (say compared to average of all other jobs) durations another way i'd guess is the necessity to award credits/allow claimed credits to the 'no solution' (no models) runs where after the 'reasonable' timeframe no solutions are found. i guess the max default run time is 6 hours, hence, app developer should consider terminating with credits for such cases. however, for many participants with a fairly recent cpu that runs somewhat 'fast', after 3-4 hours where there are no solution, the participant may not want to continue the run. hence, participants need to have a 'computing preference' to state that the max default run time preferred is hence say 4 hours. ID: 5757 · Reply Quote

sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0	Message 5758 - Posted: 20 Jul 2014, 16:42:52 UTC Last modified: 20 Jul 2014, 17:08:29 UTC remaining 4 Tc804_symm_hybrid work units completed successfully no errors https://ralph.bakerlab.org/result.php?resultid=3014838, 4,763.77 cpu secs https://ralph.bakerlab.org/result.php?resultid=3014871, 4,601.52 cpu secs https://ralph.bakerlab.org/result.php?resultid=3014874, 4,476.31 cpu secs https://ralph.bakerlab.org/result.php?resultid=3014875, 8,502.676 cpu secs cpu time varies as it seemed, after all psuedorandom numbers are involved in the simulations/solution search caveat those that clocked 4k cpu secs could be jobs that show up as 0% in bonic-client / gui after the shutdown interruption. that may suggest some bugs (not sure where/which app boinc-client, or rosetta) in updating the state xml statistics files. i.e. bonic-client/gui restarted showing 0%, however, rosetta probably did save the state and hence 3 jobs suggestively ended in half the timeframe. i.e. the statistics for the cpusecs is incorrect, those 3 jobs actually ran for 8k cpu secs. there are 4000 cpu secs 'lost' for each of the 3 jobs before the suspend project / boinc-client shutdown interruption, this is more like a missing statistics update. However, what could be postulated is that rosetta did checkpoint and resumed from the interruption, hence the 3 jobs suggestively completes in 4k cpu secs as the first half of the cpu secs statistics is 'lost'. i.e. if rosetta did not checkpoint, what would have showed up would be 8k cpu secs and the actual total cpu secs would be 12k cpu secs for those jobs i guess i'd upgrade my boinc-client to see if that'd resolve the issue --- note that this has major impact to credits claimed / granted. as the 'lost' cpu secs would suggest that that job can be done in 1/2 the cpu secs (i.e. half of 8 k actual) which is incorrect ID: 5758 · Reply Quote

TPCBF Send message Joined: 20 Jun 11 Posts: 30 Credit: 27,776 RAC: 0	Message 5759 - Posted: 20 Jul 2014, 21:31:25 UTC - in response to Message 5758. Same here, the 4 WUs I p/u on the 17th just keep restarting from 0% over and over again and each time, at least during the initial time, are trashing the hard drive like crazy... Is anyone from the project actually around to monitor any responses. Or is Mr.Baker & Cie only available when there's a chance to bask in the limelight? Ralf ID: 5759 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5760 - Posted: 21 Jul 2014, 6:54:34 UTC - in response to Message 5759. Is anyone from the project actually around to monitor any responses. Or is Mr.Baker & Cie only available when there's a chance to bask in the limelight? You're too mocking....but you've some reasons. There is a big "lack of comunications" with r@h team. ID: 5760 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5761 - Posted: 21 Jul 2014, 20:47:19 UTC Wow, 20 points for over 6h of run :-P # cpu_run_time_pref: 7200 BOINC:: CPU time: 22084.9s, 14400s + 7200s[2014- 7-21 22:38:48:] :: BOINC WARNING! cannot get file size for default.out.gz: could not open file. Output exists: default.out.gz Size: -1 InternalDecoyCount: 0 (GZ) ----- 0 ----- Stream information inconsistent. Writing W_0000001 ====================================================== DONE :: 1 starting structures 22084.9 cpu seconds This process generated 1 decoys from 1 attempts ====================================================== called boinc_finish </stderr_txt> ]]> Validate state Valid Claimed credit 78.0462089549509 Granted credit 20 ID: 5761 · Reply Quote

sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0	Message 5762 - Posted: 22 Jul 2014, 1:33:10 UTC - in response to Message 5761. Last modified: 22 Jul 2014, 1:55:10 UTC Wow, 20 points for over 6h of run :-P Validate state Valid Claimed credit 78.0462089549509 Granted credit 20 based on what i understand from rosetta@home message boards, the granted credit which tend to be different (not necessarily lower) is apparently due to averages being used. i've observed cases where granted credit > claimed credit i.e. every participant's PC claim a certain number of computed credits (this is the actual cpu work done), if there is no fraud claim credits is actually accurate. however what's granted is the average apparently as i've posted earlier in this thread there are bugs in boinc client, in my case if i suspend the jobs and shutdown the clients and restart them later, statistics for the initial run could be lost. however, apparently rosetta did checkpoint successfully and resumed from the point it is restated. hence, if the task is say 100 credits, and if the shutdown occur at 99 credits worth of cpu time, when i restart those tasks that's affected by the boinc client bug, it would complete that and claim 1 credit. that would wrongly imply that a 100 credits job can be done in 1 credit effort (this is completely inaccurate) rosetta should built-in in the formula to reject out of band claimed credit for jobs. this can be done by taking standard deviations and rejecting those falling more than one or 1.96 (95% confidence interval, http://en.wikipedia.org/wiki/1.96) standard deviations below the statistical averages when computing the granted credits. that should result in the higher claim credits being averaged to reflect the true effort needed to complete the tasks hope admins consider that and enhance the server codes. rosetta@home/ralph@home should not be 'stingy' with credits as those are true work done and the project as a whole is competing with other boinc projects to show that they are popular projects that's getting the participant's attention (it is a very good form of free advertising for the project) ID: 5762 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5763 - Posted: 22 Jul 2014, 8:52:51 UTC - in response to Message 5762. i've observed cases where granted credit > claimed credit I know, i know the "problem" granted/claimed. I hope you realize i don't partecipate for credits :-) hope admins consider that and enhance the server codes. This request is repeated frequently on Rosetta forum. Other volunteers have suggested that it is a problem of customization of actual server. But admins have not said anything. ID: 5763 · Reply Quote

sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0	Message 5764 - Posted: 22 Jul 2014, 15:57:54 UTC - in response to Message 5763. Last modified: 22 Jul 2014, 16:19:14 UTC i've observed cases where granted credit > claimed credit I know, i know the "problem" granted/claimed. I hope you realize i don't partecipate for credits :-) hope admins consider that and enhance the server codes. This request is repeated frequently on Rosetta forum. Other volunteers have suggested that it is a problem of customization of actual server. But admins have not said anything. strictly speaking i'm speculating a possible factor for the low credits is mainly an instance caused (most likely) by faulty boinc-client s/w. as statistics is 'lost' on a shutdown/restart, it 'mis-reports' credits to the server. as the claimed credits is much lower after the restart it affects the average credits that's awarded to the task and any later participants. while i normally ignore them (as like u credits aren't really the purpose to crunch rosetta), it may make some participants unhappy about the low granted credits esp for those who pick up the subsequent same jobs. the solution of course is to fix my (an instance of) faulty boinc client, but i'm just putting in my 2 cents reasoning on the 'collerateral damage' that others may observe. my guess is that this issue may be partially alleviated from the server if the server ignores exceptionally low credits when computing the granted credits. i'm not too sure if there may better way to award 'credits', however, taking an average of reported claimed credits is after all a good way to measure the work done statistically averaged across different systems. just that in this 'simple' points(credit) system, it is prone to be affected by 'mis-behaving' clients. i guess there really aren't perfect solutions i'd soon upgrade my client, hopefully that'd 'fix' some of the statistical issues from my little leaf node ID: 5764 · Reply Quote

Mad_Max Send message Joined: 15 Nov 12 Posts: 15 Credit: 404,700 RAC: 0	Message 5765 - Posted: 22 Jul 2014, 21:49:14 UTC Last modified: 22 Jul 2014, 22:02:32 UTC It is NOT "faulty boinc-client s/w." OR " statistics is 'lost' on a shutdown/restart" It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart. So BOINC software do right when reset statistic and credits to zero too because: 0 useful work done = 0 Cr Also some of WUs run so long (possible algorithm looped infinitely or just very difficult model to calculate) so even after 5-7 hours of running(without interruptions / restarts) on modern CPU can not finish very first model (decoy). In this situations claimed credits (calculated by BOINC client) will be normal. But granted credit actually = 0, because R@H for granted credit use such formula: average claimed credit per 1 decoy (collected and calculated from prev users who report WUs from same batch) multiply by number of decoys reported in particular task of a specific user. So if decoy count = 0, granted credits = 0 too. But later programmers added exception: if user report task with decoy count = 0 not use general formula (which gives 0 Cr) but reward WU with fixed 20 Cr as some sort of consolation/booby prize. ID: 5765 · Reply Quote

Mad_Max Send message Joined: 15 Nov 12 Posts: 15 Credit: 404,700 RAC: 0	Message 5766 - Posted: 22 Jul 2014, 21:59:23 UTC - in response to Message 5756. Wus with names Tc794_hybrid... Tc804_summ_hybrid... Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours) Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU I have not tested the check-pointing so I can't test if the work units restart from zero like you are seeing. Conan To roughly check works of checkpoints not necessarily to restart. You can click "properties" of any of the currently executing task and check the line "CPU time at last checkpoint". If checkpoints saving are working normal there will be the time(counted from start of task) of last checkpoint saved. If checkpoint does not work there will be "-- --" on this line. Or time few hours ago/less compared to total CPU time - if the client could finish at least one model completely and recorded it on a disk - it also counted as checkpoint and usual this part work normal. ID: 5766 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5767 - Posted: 23 Jul 2014, 6:40:03 UTC - in response to Message 5765. Last modified: 23 Jul 2014, 6:41:11 UTC It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart. Yeap, i know the "situation" decoy/checkpoint My problem is that some (not a few) wus restart during crunching, without restart of pc/boinc manager and, i repeat, with this message: Task Tc804_symm_hybrid_20174_15261_0 exited with zero status but no 'finished' file ID: 5767 · Reply Quote

cmt.explorer Send message Joined: 24 Jul 14 Posts: 2 Credit: 95 RAC: 0	Message 5768 - Posted: 24 Jul 2014, 13:18:15 UTC Greetings, what is meant in the news post to this thread by "If you have an android arm device/phone that supports android-9, ..."? What should "android-9" be? I tried to start a workunit on Android 4.4.4 with the NativeBoinc client but it didn't work - additonally I got the message "Rosetta Mini is not availiable for your type of computer. Any ideas? Thanks! ID: 5768 · Reply Quote

TPCBF Send message Joined: 20 Jun 11 Posts: 30 Credit: 27,776 RAC: 0	Message 5769 - Posted: 25 Jul 2014, 2:41:26 UTC - in response to Message 5766. To roughly check works of checkpoints not necessarily to restart. You can click "properties" of any of the currently executing task and check the line "CPU time at last checkpoint". If checkpoints saving are working normal there will be the time(counted from start of task) of last checkpoint saved. If checkpoint does not work there will be "-- --" on this line. Or time few hours ago/less compared to total CPU time - if the client could finish at least one model completely and recorded it on a disk - it also counted as checkpoint and usual this part work normal. The problem with the current checkpoint setting in the WUs is that the recent batch of WUs seem to reset itself a lot, always starting from scratch instead of being able to continue from the last checkpoint. That's the purpose of checkpoints. As it is currently, a lot of processing power get's wasted this way... Ralf ID: 5769 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5770 - Posted: 25 Jul 2014, 6:26:36 UTC - in response to Message 5768. What should "android-9" be? I'm not sure, but i think it's the version of api. ApiLevels I tried to start a workunit on Android 4.4.4 with the NativeBoinc client but it didn't work - additonally I got the message "Rosetta Mini is not availiable for your type of computer. Any ideas? The first version running on android is 3.53. Now there is a batch of 3.52 so it is normal this message. For Admins: i know 3.53 is the first version, but can you optimize it, please? This version uses a lot of ram, a lot of disc space, continuous restarts... ID: 5770 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5771 - Posted: 25 Jul 2014, 6:27:04 UTC - in response to Message 5769. As it is currently, a lot of processing power get's wasted this way... +1 ID: 5771 · Reply Quote

sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0	Message 5772 - Posted: 27 Jul 2014, 15:25:56 UTC - in response to Message 5765. Last modified: 27 Jul 2014, 16:11:12 UTC It is NOT "faulty boinc-client s/w." OR " statistics is 'lost' on a shutdown/restart" It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart. So BOINC software do right when reset statistic and credits to zero too because: 0 useful work done = 0 Cr hi Max, Thanks much for your post, I think i can confirm your observation: There is no checkpoint ! all 6 concurrent ralph@home tasks did not checkpoint after running for more than an hour. this is a screen print, time of last checkpoint is -- and elapsed time is some 1 hour 15 minutes compared to a a concurrently running task from rosetta@home the rosetta@home task is checkpointing well as indicated by the time of last checkpoint note that apparently the minirosetta 3.52, 2.53 (beta) binaries running on ralph@home and rosetta@home are the same https://ralph.bakerlab.org/forum_thread.php?id=557&nowrap=true#5753 note that all these ralph@home and rosetta@home sessions are running concurrently in the same boinc-client (7.0.36) ! some error in the job run parameters or that it's necessary to improve minirosetta to make such complex jobs/tasks checkpoint? ID: 5772 · Reply Quote

[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 932 Credit: 1,892,541 RAC: 294	Message 5773 - Posted: 28 Jul 2014, 18:29:23 UTC Validate errors: 3052012 3052013 ID: 5773 · Reply Quote