Message boards : RALPH@home bug list : minirosetta beta 3.50-3.52 apps
Previous · 1 · 2 · 3 · 4 · Next
Author | Message |
---|---|
Mad_Max Send message Joined: 15 Nov 12 Posts: 15 Credit: 404,700 RAC: 0 |
Wus with names Tc794_hybrid... Tc804_summ_hybrid... Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours) Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
As i say, continuos restart 20/07/2014 12:50:54 | ralph@home | Task Tc804_symm_hybrid_20174_15261_0 exited with zero status but no 'finished' file I reset the project, without result |
Conan Send message Joined: 16 Feb 06 Posts: 364 Credit: 1,368,421 RAC: 0 |
Wus with names Each work unit will run until at least 1 Decoy has been run. If it takes longer than your set preferences then it keeps going till at least 1 Decoy finishes. Setting to 2 hours will work most of the time. The 5-6 hour run times happen to suit my 6 hour preference setting, but even then I have had some go over 7 hours. I have not tested the check-pointing so I can't test if the work units restart from zero like you are seeing. Conan |
sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0 |
checkpoint fail TC804_symm_hybrid, Rosetta mini 3.52 - linux x64 - boinc client 7.0.36 7 concurrent (same) tasks starts and runs for an hour, reached 25% completion suspended project, shutdown boinc-client (note, done via boinc-gui) restarted project, out of the 7 tasks only 1 restarted from 25% completion, the rest of 6 tasks restarted from 0. in effect lost 6 x 1 hours work. aborted 3 tasks, resume runs on half load 4 tasks checkpoint preferences set for 60 secs. not sure where's the root cause. (boinc-client, rosetta app, or the parameters used to run the app e.g. if there are no structures and the session is interrupted that'd effectively means if the task/job is restarted it'd start from zero all over? hmm, perhaps something to be considered and improved on i'd think rosetta need to save the state even if no structures are generated esp for such large?/complex? jobs where for that matter there may be no structures (i.e. the run did not find a root/solution/model) the other thing would be if some jobs hits a 'dead end' (runs for hours without finding solutions, perhaps goes into endless chaotic loops), there'd hence be no structures & no credits would be awarded/claimed? i'd think participants need to have influence on the max default run time per task, i.e. some participants would not be too happy to crunch perhaps jobs that runs say for 5-6 hours and not find a solution and hence no credits. if this is not possible, then participants may simply need to abort long jobs that goes beyond the 'normal' (say compared to average of all other jobs) durations another way i'd guess is the necessity to award credits/allow claimed credits to the 'no solution' (no models) runs where after the 'reasonable' timeframe no solutions are found. i guess the max default run time is 6 hours, hence, app developer should consider terminating with credits for such cases. however, for many participants with a fairly recent cpu that runs somewhat 'fast', after 3-4 hours where there are no solution, the participant may not want to continue the run. hence, participants need to have a 'computing preference' to state that the max default run time preferred is hence say 4 hours. |
sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0 |
remaining 4 Tc804_symm_hybrid work units completed successfully no errors https://ralph.bakerlab.org/result.php?resultid=3014838, 4,763.77 cpu secs https://ralph.bakerlab.org/result.php?resultid=3014871, 4,601.52 cpu secs https://ralph.bakerlab.org/result.php?resultid=3014874, 4,476.31 cpu secs https://ralph.bakerlab.org/result.php?resultid=3014875, 8,502.676 cpu secs cpu time varies as it seemed, after all psuedorandom numbers are involved in the simulations/solution search *caveat* those that clocked 4k cpu secs could be jobs that show up as 0% in bonic-client / gui after the shutdown interruption. that may suggest some bugs (not sure where/which app boinc-client, or rosetta) in updating the state xml statistics files. i.e. bonic-client/gui restarted showing 0%, however, rosetta probably did save the state and hence 3 jobs suggestively ended in half the timeframe. i.e. the statistics for the cpusecs is incorrect, those 3 jobs actually ran for 8k cpu secs. there are 4000 cpu secs 'lost' for each of the 3 jobs before the suspend project / boinc-client shutdown interruption, this is more like a missing statistics update. However, what could be postulated is that rosetta did checkpoint and resumed from the interruption, hence the 3 jobs suggestively completes in 4k cpu secs as the first half of the cpu secs statistics is 'lost'. i.e. if rosetta did not checkpoint, what would have showed up would be 8k cpu secs and the actual total cpu secs would be 12k cpu secs for those jobs i guess i'd upgrade my boinc-client to see if that'd resolve the issue --- note that this has major impact to credits claimed / granted. as the 'lost' cpu secs would suggest that that job can be done in 1/2 the cpu secs (i.e. half of 8 k actual) which is *incorrect* |
TPCBF Send message Joined: 20 Jun 11 Posts: 30 Credit: 27,776 RAC: 0 |
Same here, the 4 WUs I p/u on the 17th just keep restarting from 0% over and over again and each time, at least during the initial time, are trashing the hard drive like crazy... Is anyone from the project actually around to monitor any responses. Or is Mr.Baker & Cie only available when there's a chance to bask in the limelight? Ralf |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
Is anyone from the project actually around to monitor any responses. Or is Mr.Baker & Cie only available when there's a chance to bask in the limelight? You're too mocking....but you've some reasons. There is a big "lack of comunications" with r@h team. |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
Wow, 20 points for over 6h of run :-P # cpu_run_time_pref: 7200 |
sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0 |
Wow, 20 points for over 6h of run :-P based on what i understand from rosetta@home message boards, the granted credit which tend to be different (not necessarily lower) is apparently due to averages being used. i've observed cases where granted credit > claimed credit i.e. every participant's PC claim a certain number of computed credits (this is the actual cpu work done), if there is no fraud claim credits is actually *accurate*. however what's granted is the average apparently as i've posted earlier in this thread there are bugs in boinc client, in my case if i suspend the jobs and shutdown the clients and restart them later, statistics for the initial run could be lost. however, apparently rosetta did checkpoint successfully and resumed from the point it is restated. hence, if the task is say 100 credits, and if the shutdown occur at 99 credits worth of cpu time, when i restart those tasks that's affected by the boinc client bug, it would complete that and claim 1 credit. that would *wrongly* imply that a 100 credits job can be done in 1 credit effort (this is completely inaccurate) rosetta should built-in in the formula to reject out of band claimed credit for jobs. this can be done by taking standard deviations and rejecting those falling more than one or 1.96 (95% confidence interval, http://en.wikipedia.org/wiki/1.96) standard deviations below the statistical averages when computing the granted credits. that should result in the higher claim credits being averaged to reflect the true effort needed to complete the tasks hope admins consider that and enhance the server codes. rosetta@home/ralph@home should not be 'stingy' with credits as those are *true work done* and the project as a whole is competing with other boinc projects to show that they are popular projects that's getting the participant's attention (it is a very good form of free advertising for the project) |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
i've observed cases where granted credit > claimed credit I know, i know the "problem" granted/claimed. I hope you realize i don't partecipate for credits :-) hope admins consider that and enhance the server codes. This request is repeated frequently on Rosetta forum. Other volunteers have suggested that it is a problem of customization of actual server. But admins have not said anything. |
sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0 |
i've observed cases where granted credit > claimed credit strictly speaking i'm speculating a possible factor for the low credits is mainly an *instance* caused (most likely) by faulty boinc-client s/w. as statistics is 'lost' on a shutdown/restart, it 'mis-reports' credits to the server. as the claimed credits is much lower after the restart it affects the average credits that's awarded to the task and any later participants. while i normally ignore them (as like u credits aren't really the purpose to crunch rosetta), it may make some participants unhappy about the low granted credits esp for those who pick up the subsequent same jobs. the solution of course is to fix my (an instance of) faulty boinc client, but i'm just putting in my 2 cents reasoning on the 'collerateral damage' that others may observe. my guess is that this issue may be partially alleviated from the server if the server ignores exceptionally low credits when computing the granted credits. i'm not too sure if there may better way to award 'credits', however, taking an average of reported claimed credits is after all a good way to measure the work done statistically averaged across different systems. just that in this 'simple' points(credit) system, it is prone to be affected by 'mis-behaving' clients. i guess there really aren't perfect solutions i'd soon upgrade my client, hopefully that'd 'fix' some of the statistical issues from my little leaf node |
Mad_Max Send message Joined: 15 Nov 12 Posts: 15 Credit: 404,700 RAC: 0 |
It is NOT "faulty boinc-client s/w." OR " statistics is 'lost' on a shutdown/restart" It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart. So BOINC software do right when reset statistic and credits to zero too because: 0 useful work done = 0 Cr Also some of WUs run so long (possible algorithm looped infinitely or just very difficult model to calculate) so even after 5-7 hours of running(without interruptions / restarts) on modern CPU can not finish very first model (decoy). In this situations claimed credits (calculated by BOINC client) will be normal. But granted credit actually = 0, because R@H for granted credit use such formula: average claimed credit per 1 decoy (collected and calculated from prev users who report WUs from same batch) multiply by number of decoys reported in particular task of a specific user. So if decoy count = 0, granted credits = 0 too. But later programmers added exception: if user report task with decoy count = 0 not use general formula (which gives 0 Cr) but reward WU with fixed 20 Cr as some sort of consolation/booby prize. |
Mad_Max Send message Joined: 15 Nov 12 Posts: 15 Credit: 404,700 RAC: 0 |
Wus with names To roughly check works of checkpoints not necessarily to restart. You can click "properties" of any of the currently executing task and check the line "CPU time at last checkpoint". If checkpoints saving are working normal there will be the time(counted from start of task) of last checkpoint saved. If checkpoint does not work there will be "-- --" on this line. Or time few hours ago/less compared to total CPU time - if the client could finish at least one model completely and recorded it on a disk - it also counted as checkpoint and usual this part work normal. |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart. Yeap, i know the "situation" decoy/checkpoint My problem is that some (not a few) wus restart during crunching, without restart of pc/boinc manager and, i repeat, with this message: Task Tc804_symm_hybrid_20174_15261_0 exited with zero status but no 'finished' file |
cmt.explorer Send message Joined: 24 Jul 14 Posts: 2 Credit: 95 RAC: 0 |
Greetings, what is meant in the news post to this thread by "If you have an android arm device/phone that supports android-9, ..."? What should "android-9" be? I tried to start a workunit on Android 4.4.4 with the NativeBoinc client but it didn't work - additonally I got the message "Rosetta Mini is not availiable for your type of computer. Any ideas? Thanks! |
TPCBF Send message Joined: 20 Jun 11 Posts: 30 Credit: 27,776 RAC: 0 |
To roughly check works of checkpoints not necessarily to restart.The problem with the current checkpoint setting in the WUs is that the recent batch of WUs seem to reset itself a lot, always starting from scratch instead of being able to continue from the last checkpoint. That's the purpose of checkpoints. As it is currently, a lot of processing power get's wasted this way... Ralf |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
What should "android-9" be? I'm not sure, but i think it's the version of api. ApiLevels I tried to start a workunit on Android 4.4.4 with the NativeBoinc client but it didn't work - additonally I got the message "Rosetta Mini is not availiable for your type of computer. The first version running on android is 3.53. Now there is a batch of 3.52 so it is normal this message. For Admins: i know 3.53 is the first version, but can you optimize it, please? This version uses a lot of ram, a lot of disc space, continuous restarts... |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
As it is currently, a lot of processing power get's wasted this way... +1 |
sgaboinc Send message Joined: 8 Jul 14 Posts: 20 Credit: 4,159 RAC: 0 |
It is NOT "faulty boinc-client s/w." OR " statistics is 'lost' on a shutdown/restart" hi Max, Thanks much for your post, I think i can confirm your observation: There is no checkpoint ! all 6 concurrent ralph@home tasks did not checkpoint after running for more than an hour. this is a screen print, time of last checkpoint is -- and elapsed time is some 1 hour 15 minutes compared to a a concurrently running task from rosetta@home the rosetta@home task is checkpointing well as indicated by the time of last checkpoint note that apparently the minirosetta 3.52, 2.53 (beta) binaries running on ralph@home and rosetta@home are the same https://ralph.bakerlab.org/forum_thread.php?id=557&nowrap=true#5753 note that all these ralph@home and rosetta@home sessions are running concurrently in the same boinc-client (7.0.36) ! some error in the job run parameters or that it's necessary to improve minirosetta to make such complex jobs/tasks checkpoint? |
[VENETO] boboviz Send message Joined: 9 Apr 08 Posts: 904 Credit: 1,889,390 RAC: 0 |
|
Message boards :
RALPH@home bug list :
minirosetta beta 3.50-3.52 apps
©2024 University of Washington
http://www.bakerlab.org