Posts by Mad_Max

1) Message boards : RALPH@home bug list : Rosetta 4.12+ (Message 6712)
Posted 8 Apr 2020 by Mad_Max
Post:
I aborted some of the older test batches. I'm not sure why your client is getting confused and running the wrong app. It should be running the 64bit version on your 64bit computer.

I was writing of getting 64bit "wrapper" app on 32 bit machines including old running under WinXP.
Of course all such WUs fails as Win32 systems can not execute any 64 bit apps. Producing error "Application is not a valid Win32 app" right at start.

I don't have any problems on 64bit windows systems currently. Latest problem was downloading failures of small files, but looks like it resolved now as i didn't saw such errors for about a week.

If older systems not longer supported by project you should adjust server scheduler accordingly, so it should not send tasks to such machines and respond with error/warning, instead of sending work to such host doomed to 100% error rate and wasting internet bandwidth and excess server load.
2) Message boards : RALPH@home bug list : Rosetta 4.12+ (Message 6694)
Posted 6 Apr 2020 by Mad_Max
Post:
7.14.2 and 7.14.4 are the latest stable versions of BOINC client.

But there are never beta-test versions available:

https://boinc.berkeley.edu/download_all.php

https://boinc.berkeley.edu/dl/
3) Message boards : RALPH@home bug list : Rosetta 4.12+ (Message 6693)
Posted 6 Apr 2020 by Mad_Max
Post:
All R@H WUs fails on 32bit version of Windows. Because R@H try to start 64 bit versions (which actually not 64 bit, but 32bit app in the 64bit wrapper).
06/04/2020 10:44:10 | ralph@home | Finished download of minirosetta_database_357d5d93529_n_methyl.zip
06/04/2020 10:44:43 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:44 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:45 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:45 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:46 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:51 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:52 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:52 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:52 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:52 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:57 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:58 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:59 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:59 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
06/04/2020 10:44:59 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)

error code 193 = Application is not a valid Win32 app

If you want to drop win32 support, then should not send work to all such hosts and rise min system requirements.
4) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6492)
Posted 17 Jan 2018 by Mad_Max
Post:
ALL of my WUs from latest batch errored out

Some very soon(just 1-2 min) after start with similar errors:
ERROR: Warning: can't open file t000_.fasta!
ERROR:: Exit from: ......srccoresequenceutil.cc line: 148
BOINC:: Error reading and gzipping output datafile: default.out


Other near the end of computation with another errors:
<message>
upload failure: <file_xfer_error>
  <file_name>40a6c_SOL_jumping_all_pairings_SAVE_ALL_OUT_20700_2_1_r1951005696_0</file_name>
  <error_code>-161 (not found)</error_code>
</file_xfer_error>
</message>
5) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6242)
Posted 22 Nov 2017 by Mad_Max
Post:
Same as previous (http://ralph.bakerlab.org/forum_thread.php?id=586&nowrap=true#6224) - all wus have failed immediately after start.
Actually it even can not start - BOINC fails to launch R@H process.
.....................
22/11/2017 05:26:23 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:24 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:24 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:25 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:25 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:27 | ralph@home | Computation for task ab_12_01__vall_2011_2ci2I_vall_2011_9mers_3mers_20693_63_0 finished
22/11/2017 05:26:27 | ralph@home | Output file ab_12_01__vall_2011_2ci2I_vall_2011_9mers_3mers_20693_63_0_0 for task ab_12_01__vall_2011_2ci2I_vall_2011_9mers_3mers_20693_63_0 absent
22/11/2017 05:26:29 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:30 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:30 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:31 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:31 | ralph@home | [error] Process creation failed: (unknown error) - error code 193 (0xc1)
22/11/2017 05:26:34 | ralph@home | Computation for task cp11d2v1_cpp_11_D2_v1_SAVE_ALL_OUT_20695_1177_0 finished
22/11/2017 05:26:34 | ralph@home | Output file cp11d2v1_cpp_11_D2_v1_SAVE_ALL_OUT_20695_1177_0_0 for task cp11d2v1_cpp_11_D2_v1_SAVE_ALL_OUT_20695_1177_0 absent
..............
6) Message boards : RALPH@home bug list : Rosetta_beta 4.0+ (Message 6224)
Posted 3 Nov 2017 by Mad_Max
Post:
All of my rosetta_beta_4.05 WUs also failed immediately after start.
Actually it even can not start - BOINC fails to launch R@H process.

And if I try to launch rosetta_beta_4.05_windows_intelx86.exe from project folder - there is an error about "rosetta_beta_4.05_windows_intelx86.exe is not a Win32 application".
7) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 6068)
Posted 19 Mar 2016 by Mad_Max
Post:
Same here. A LOT of random WUs crashes on v 3.72
Different hosts, different CPUs (4/6/8 cores), different OS (Win 7 x64 and WinXP x32) - all getting a lot failed WUs with "Unhandled Exception Detected..." in logs
8) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 5993)
Posted 14 Jan 2016 by Mad_Max
Post:
P.S.
Problem with huge slowdowns at initialization on HDD is known long ago for R@H/RALPH. But over time, it is constantly growing because main minirosetta database grow too with almost each new release:
- when i start crunching for R@H it was "only" ~1500 files and 100 Mb
- 2 years ago it was ~2500 files and 150 Mb
- now it is ~4000 files and ~350 Mb for last minirosetta v.3.70

So perhaps there is no any new bugs in BOINC or R@H and we just finally reach HDD limits with last update and start triggering some sort of BOINC timeouts/watchdog.
9) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 5992)
Posted 14 Jan 2016 by Mad_Max
Post:
It is a not about situation with few Rosetta WUs running at same time. Running 4 WUs or even 6-8 WUs in parallel on HDD is OK if PC have enough RAM. Because at running stage R@H use disk only a little and do not cause any problems.

At loading stage (after BOINC restart / PC reboot) it is much more disk load from R@H/RALPH, but HDD can still cope with atleast 4 WUs parallel startup without problems (it takes few minutes of high HDD load, but not cause any errors).

It is about initial starting/initialization of few new WUs in parallel - because at this point full rosetta database (which now HUGE ~4000 files, ~350 Mb) + WUs data is extracting from project folder to dataslotsx folders for each WU. This put really high stress and slowdown on HDDs.

Usual BOINC initialize just one WU at a time because begin new WUs only after one of previous finished and all WUs have different run times so start time naturally shifting. So no problems too.
But in some conditions like:
- a mere coincidence of initialization time of few WUs
- after long outage of project (boinc is "hungry" for jobs from this project and try to start an many WUs from it as it can immediately after download finished)
- after project reset by user
- at initial project addition to BOINC
Multiple WUs try to initialize at same time and problem arise.

So it is not often situation. And seen only on HDD, while SSD do not have any problems at all.
So hard limit CPU cores allowed to run R@H would be like be beheaded instead of shave :) Though of course the problems with shaving is also solve this way lol
10) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 5989)
Posted 14 Jan 2016 by Mad_Max
Post:
With 3.70 release a got strange bug - it looks like BOINC goes to infinite loop while extracting new RALPH WUs - disk work very hard non stop (it is classic HDD on this PC, not SSD) but WU can not load.
After ~15 min of non stop disk work i open process explorer to look what happening. I notice this:
boinc.exe process constantly reading and writing something to/from disk.
minirosetta_beta process (there was 3 of them on 4x core CPU, 4th was from WCG and work fine) starts, running for some time with low CPU utilization and exit. Then start again, work for some time (like ~1 min) and exit. And so on.
BOINC Manager (GUI) was not responsive at this time - it work, but not updating any status and not respond to any commands like pause or abort WUs (looks like it lost connection to boinc.exe, or boinc.exe not responding).

So I kill all BOINC and rosetta processes via process explorer and restart BOINC.
Same thing happened again - BOINC stuck while try to start 3 new RALPS WUs in parallel and stress HDD hard.
This time i try another thing - instead of killing minirosetta_beta process i suspend (pause by OS) 2 of 3 processes. 1 still running and after some time begin work normal: utilize full CPU core, stop hammering HDD, BOINC Manager begin work normal too.
Later i resume 2nd minirosetta_beta - it start OK, and 3rd minirosetta_beta - all OK too.

I do full restart - all work fine after restart too. And i can not reproduce this bug anymore.

It looks for me like latest BOINC ( i use 7.6.9 x86) or RALPH have some sort of timeout for loading(starting) of new WUs. And it is set to relative low value (like ~1 min). And if few rosetta WUs try to start at same time it slow downs classic HDDs so hard (because of extracting a few thousands small files for each WU from archive) so run out of this timeout and get restarted in a loop.
11) Message boards : RALPH@home bug list : Rosetta mini beta and/or android 3.61-3.83 (Message 5988)
Posted 14 Jan 2016 by Mad_Max
Post:
It looks like the quota was set for 20 on our server and I just updated it to 50. I'm not sure where the 6 is coming from. Thanks for helping with this.


AFAIK quota in BOINC became dynamic long ago and if computer report task with errors quota is cut. And go back to default(set in server settings) if computer begin reporting successful tasks.
So it can be any from 1 per CPU core / day to default value. 6 probable is 1 WU/day on 6 core/thread CPU.
12) Message boards : RALPH@home bug list : minirosetta beta 3.50-3.52 apps (Message 5766)
Posted 22 Jul 2014 by Mad_Max
Post:
Wus with names
Tc794_hybrid...
Tc804_summ_hybrid...

Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours)
Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs
AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU


I have not tested the check-pointing so I can't test if the work units restart from zero like you are seeing.

Conan

To roughly check works of checkpoints not necessarily to restart.
You can click "properties" of any of the currently executing task and check the line "CPU time at last checkpoint".
If checkpoints saving are working normal there will be the time(counted from start of task) of last checkpoint saved. If checkpoint does not work there will be "-- --" on this line.
Or time few hours ago/less compared to total CPU time - if the client could finish at least one model completely and recorded it on a disk - it also counted as checkpoint and usual this part work normal.
13) Message boards : RALPH@home bug list : minirosetta beta 3.50-3.52 apps (Message 5765)
Posted 22 Jul 2014 by Mad_Max
Post:
It is NOT "faulty boinc-client s/w." OR " statistics is 'lost' on a shutdown/restart"

It is faulty rosetta software (or particular WUs batch) - it simply not write checkpoints at all (i already check this - intermediate checkpoints in last Wus batches not working, seems only full/finished models saved to disk). So at each restart ALL work already done before restart went to trash can. And start work from scratch after restart.
So BOINC software do right when reset statistic and credits to zero too because: 0 useful work done = 0 Cr

Also some of WUs run so long (possible algorithm looped infinitely or just very difficult model to calculate) so even after 5-7 hours of running(without interruptions / restarts) on modern CPU can not finish very first model (decoy).

In this situations claimed credits (calculated by BOINC client) will be normal. But granted credit actually = 0, because R@H for granted credit use such formula:
average claimed credit per 1 decoy (collected and calculated from prev users who report WUs from same batch) multiply by number of decoys reported in particular task of a specific user.
So if decoy count = 0, granted credits = 0 too.

But later programmers added exception: if user report task with decoy count = 0 not use general formula (which gives 0 Cr) but reward WU with fixed 20 Cr as some sort of consolation/booby prize.
14) Message boards : RALPH@home bug list : minirosetta beta 3.50-3.52 apps (Message 5754)
Posted 20 Jul 2014 by Mad_Max
Post:
Wus with names
Tc794_hybrid...
Tc804_summ_hybrid...

Have problems with checkpointing (usual not working at all - reset to 0% progress if restart). And usual run much longer to target time (my target time set to 2 hour, but usual run 5-6 hours)
Also some of this Wus grant only 20 Cr and have "InternalDecoyCount: 0 (GZ)" in logs
AFAIK it is mean what no any usuful work was done and 5-6 hours of CPU time wasted at each such WU
15) Message boards : RALPH@home bug list : Minirosetta beta 3.45 (Message 5629)
Posted 5 Feb 2013 by Mad_Max
Post:
Almost all WU from last 2 batches(...beta_subunit_hybridize_refine and ...beta_refine_with_xstal) fails on my comp.
And they use VERY large amounts of RAM: from 700 to 1600 Mb per each WU (thread).
And checkpoints not work (fall back to 0.0% after restart even after 3-5 hours of running non stop).
Few WU which managed to complete "successfully" valued by only 20 credits and have such warning in logs:

WARNING! cannot get file size for default.out.gz: could not open file.
Output exists: default.out.gz Size: -1
InternalDecoyCount: 0 (GZ)
-----
0
-----
Stream information inconsistent.






©2024 University of Washington
http://www.bakerlab.org