Posts by Dimitris Hatzopoulos

1) Message boards : RALPH@home bug list : Bug reports for Ralph 5.16 (Message 1633)
Posted 15 May 2006 by Dimitris Hatzopoulos
Post:
Sofar, BOTH ralph 5.16 WUs errored out MAPRELAX_TEST_hom015_1fna__516_1_1, MAPRELAX_TEST_hom025_1fna__516_19_0

with "Incorrect function (0x1) - exit code 1 (0x1)"

This is on a very stable PC, which runs Rosetta flawlessly for the past 4 months (apart from 2-3 WUs which were bad).
2) Message boards : RALPH@home bug list : Bug reports for Ralph 5.09 and 5.10 (Message 1500)
Posted 5 May 2006 by Dimitris Hatzopoulos
Post:
One bug I see with 5.09 is that the "text description area" contains multiple occurances of the same text info, so e.g. right now running in front of me I see the same text repeated FOUR (4) times:

"Testing jumping/strand break protocol on BOINC with 1tul_ and 7 jumps
Sampling a comprehensive list of about 200,000 topologies and strand pairings"

Perhaps it's an oversight where it outputs the text once for every model (now processing model 4)?
3) Message boards : RALPH@home bug list : OLD- Bug reports for Windows Ver - 5.00 (and higher) (Message 1101)
Posted 12 Apr 2006 by Dimitris Hatzopoulos
Post:
"HBLR 1.0 1ogw 377 28 2" WU running 4.99 crashed here on WinXPproSP2 w/ 1GB RAM. This machine has NEVER had problems with Rosetta in the past. Leave in mem=YES.

All 4 last WUs it ran crashed

<core_client_version>5.2.13</core_client_version>
<message> - exit code -1073741819 (0xc0000005)
</message>
<stderr_txt>
# random seed: 3891466
LoadLibraryA( symsrv.dll ): GetLastError = 126
LoadLibraryA( srcsrv.dll ): GetLastError = 126

The PC is still running BOINC 5.2.13

I could upgrade BOINC, but I thought it'd be useful to test RALPH on current BOINC version as well, because that's what 99% of the users are using.
4) Message boards : RALPH@home bug list : RALPH Version News! - Version 4.85 released! (Message 967)
Posted 24 Mar 2006 by Dimitris Hatzopoulos
Post:
I also wonder about this version discrepancy.

5) Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors (Message 825)
Posted 7 Mar 2006 by Dimitris Hatzopoulos
Post:
Carlos, on second thought, you and SpareCycles are probably correct about the 256M RAM not being the reason for SIGSEGV, but on the other hand, my version of RALPH for Linux seems to be statically linked:

$ ldd rosetta_beta_4.84_i686-pc-linux-gnu
not a dynamic executable
$ file rosetta_beta_4.84_i686-pc-linux-gnu
rosetta_beta_4.84_i686-pc-linux-gnu: ELF 32-bit LSB executable, Intel 80386, version 1 (SYSV), for GNU/Linux 2.2.5, statically linked, stripped
6) Message boards : RALPH@home bug list : Report - Previously Unclassified Work Unit Errors (Message 820)
Posted 6 Mar 2006 by Dimitris Hatzopoulos
Post:
Carlos, I think the most probable explanation for SIGSEGV is because your Linux PC has only 256MB of RAM, whereas your WinXP PC has 512MB RAM.

Rosetta needs (relatively to other apps) a lot of memory, on the WinXP PC next to me it has 2 Rosetta tasks: one with 125MBytes Working Set. The other consumes just 45MBytes. So, if your Linux PC got the former, it'd probably crash with SIGSEGV, if it got the latter, it'd probably run it fine.

With 256MB RAM on a PC, it's a coin toss. I hope that eventually the BOINC/R@h system will become "smarter" so it can send smaller proteins to PCs with less RAM.

Do a

# free

on your Linux machine before running boinc/rosetta and after and let us know.

I crunched 6 WUs using rosetta_beta_4.92 (windows) and have NO errors

However with rosetta_beta_4.84 (Linux) I have several WUs with errors

ALL with the same error -> SIGSEGV
http://ralph.bakerlab.org/result.php?resultid=12969
http://ralph.bakerlab.org/result.php?resultid=13093
http://ralph.bakerlab.org/result.php?resultid=13267
http://ralph.bakerlab.org/result.php?resultid=13987
http://ralph.bakerlab.org/result.php?resultid=14057
http://ralph.bakerlab.org/result.php?resultid=14534

7) Message boards : Current tests : Switching between projects with applications removed from memory (Message 701)
Posted 27 Feb 2006 by Dimitris Hatzopoulos
Post:
I sort of doubt this is the case. I know one of the wu's got up to more than 60% before it crashed.


Due to the way "new" Rosetta WUs work (variable # Models during a fixed time period e.g. 8hr), you might want to focus more on the Model / Step statistic, rather than % progress.

In that regard, the WU stderr provided aren't very helpful to do remote-diagnostics. In my case, I got similar errors (for R@h, not RALPH) with yours on a machine which had multiple reboots over the previous 3 days, due to power problems.
8) Message boards : Feedback : Difference F@H and R@H??? (Message 698)
Posted 27 Feb 2006 by Dimitris Hatzopoulos
Post:
F@H uses a low-level but physically accruate method to simulate the folding of a protein. This requires a huge amount of CPU time, like 10,000 WUs for a fast-folding protein. Because it's physically accurate, it gives a good idea of how the protein actually goes about folding.

It is easy to get the sequence of a protein, but hard to find it's structure. With the F@H method it is possible to deduce the structure from the sequence, but only for fast-folding proteins. For most proteins of interest, the F@H method is too slow.


We don't know how accurate the F@H method is. Has F@H taken part in any CASP? Not AFAIK. For other projects we have an idea about accuracy.

Based on the info I've managed to gather on F@H, I'd say the F@H project is computing "pathways", frame by frame like a Pixar movie "Toy Story", aiming to understand the folding process itself, misfolding diseases (Alzheimer, MadCow) and perhaps create a better model of the process. I'm not quite sure about its medical relevance.

For diffs between protein projects, have a look at the protein project diffs. And please let me know if there are any facts missing.
9) Message boards : Current tests : Switching between projects with applications removed from memory (Message 649)
Posted 25 Feb 2006 by Dimitris Hatzopoulos
Post:
Removing rosetta beta 4.87 work units from memory on one of my windows machines is definitely FAILING with end state client error. This machine is a DUAL PROCESSOR P3 750 w/ 512MB ram running on Windows Server 2003.

I have now switched my configuration to keep wu's in memory and performed an update. We'll see what happens.

Curiously, I have another wu running on a Fedora box that that is showing some other bizare behavior, but I'll start a new post for this one.


I think this is the case when a slower machine (P3/750) takes too long to complete the first model and it gets pre-empted and removed from RAM / VM before even the first checkpoint is reached.

In which case you need to keep in RAM while pre-empted and/or increase times between app switching to a higher value from default 60min, to e.g. 4hr in your case.
10) Message boards : Current tests : WinXP 64bit/AMD64bit-Support? (Message 648)
Posted 25 Feb 2006 by Dimitris Hatzopoulos
Post:
Hi,

I'm not a member of the project, but I don't think there will be a 64-bit version of most of the scientific projects, because all the "real work" is FLOATING POINT operations.

A much better options would be a SSE-enabled version of Rosetta, which should speed overall projects TeraFLOPS up by 3.5x - 4x times, as F@H experience has shown the last few years.
11) Message boards : Current tests : Rosetta beta 4.84 wu seems to hang on Linux (Message 594)
Posted 25 Feb 2006 by Dimitris Hatzopoulos
Post:
Hi, I've had "hangs" as you described with R@H (not with RALPH sofar, but I've only run very few RALPH WUs).

Let's try to find possible common things between our setups:

1/ BOINC version
2/ RAM
3/ Linux kernel
4/ R settings (leave in mem=Y/N)

In my case

BOINC 5.2.14 (optimized) by Crunch3r
256MB RAM (only)
Linux kernel 2.4.27
R set to remain in mem while pre-empted

I was suspecting a memory issue, because it was a commonality with CarlosP's setup, but in your case you've too much RAM (786MB) so perhaps we can rule it out.

I wonder if it could be some race condition when the BOINC app starts Rosetta... But, I've ran 8 different BOINC projects on that Linux, and only Rosetta had this problem.
12) Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here (Message 536)
Posted 23 Feb 2006 by Dimitris Hatzopoulos
Post:
River thanks for suggestion, sofar had just been grep'ping for "pct_co" the stdout.txt file (I think recently R has changed the location and now stores it as WU description, so I have to look inside slots/x/stdout.txt to find the "real" filename), e.g.

$ fgrep pct_comp ~boinc/BOINC/projects/ralph.bakerlab.org/BARCODE_30_4ubpA_215_6_0_0 | tail

BOINC :: [2006-02-23 18:49:47] :: num_decoys: 65 :: number_of_output: 71 :: pct_complete: 0.907516
BOINC :: [2006-02-23 18:54:16] :: num_decoys: 66 :: number_of_output: 71 :: pct_complete: 0.916776
BOINC :: [2006-02-23 19:02:34] :: num_decoys: 67 :: number_of_output: 71 :: pct_complete: 0.933897
BOINC :: [2006-02-23 19:09:26] :: num_decoys: 68 :: number_of_output: 71 :: pct_complete: 0.948104
BOINC :: [2006-02-23 19:12:11] :: num_decoys: 69 :: number_of_output: 72 :: pct_complete: 0.953801

Edit: prior example was for RALPH/R4.84, current R has stdout in /slots dir

fgrep pct_comp ~boinc/BOINC/slots/0/stdout.txt
BOINC :: [2006-02-23 19:48:12] :: mode: abinitio :: nstartnm: 1 :: number_of_output: 16 :: num_decoys: 0 :: pct_complete: 0.01
BOINC :: [2006-02-23 20:05:36] :: num_decoys: 1 :: number_of_output: 27 :: pct_complete: 0.0359882
BOINC :: [2006-02-23 20:30:11] :: num_decoys: 2 :: number_of_output: 22 :: pct_complete: 0.0870625
13) Message boards : RALPH@home bug list : Report \"stuck at 1%\" bugs here (Message 515)
Posted 23 Feb 2006 by Dimitris Hatzopoulos
Post:
Stuck at 1%
http://ralph.bakerlab.org/result.php?resultid=5967

*Computer IDLE -> load average: 0.00, 0.00, 0.00

Restarted boinc


Carlos, how exactly do you determine "stuck at 1%" under your Linux host? Do you check stdout.txt by hand?

I've had several WUs "stuck" (ps shows "SN"=sleeping,nice for R task - I got it right this time :-)) under Linux. CPU is idle, as per your example and BOINC queue will freeze until I kill R task.

Common things between your setup and mine are BOINC v5.2.14 (optimized), Linux kernel 2.4.x (x=27 in my case) and just 256MB RAM.

In those cases, just kill the R process and let BOINC restart it. All times but once, the WU completed fine (obviously with different seed). Oddly, I've NEVER had R get stuck under WinXP.

A month or more ago, I followed dekim/baker instructions, to run R from cmd-line by hand with same WU/seed that bombed under BOINC, in R own dir, and it finished OK.

14) Message boards : RALPH@home bug list : A excerpt of man 1 ps (Linux) (Message 498)
Posted 22 Feb 2006 by Dimitris Hatzopoulos
Post:
Carlos, thanks for pointing it out, it is in fact S=sleeping, not stopped. I may have mixed it in some of my posts, by mistake.
15) Message boards : Feedback : Credit scores (Message 497)
Posted 22 Feb 2006 by Dimitris Hatzopoulos
Post:
Snake Doctor, I don't think anyone in this thread is talking about credits with regard to RALPH. I think everyone understands that RALPH is just a test project and any such stats are meaningless.

However, regardless what people here may think about credits, the issue of credits is very important for many "crunchers". At R's forums, you can see how often people compain about not being awarded points for errored WUs. Look at e.g. thread about fair points at WCG forums. I read concerns about R's stats in many DC forums.

I wouldn't start a thread about credits or participate in such thread in R's forums, as if it became a hot topic of debate, it'd just waste project human resources. So, personally I feel we are discussing a valid subject, with the R project's best interest in mind.
16) Message boards : Feedback : Credit scores (Message 493)
Posted 22 Feb 2006 by Dimitris Hatzopoulos
Post:
Redundancy just for credits (i.e. not when there's a legitimate science reason) is a really wasteful choince, as it hurts overall TFLOPS throughput, cutting it to 1/2 or 1/5th or even less of the "raw" TFLOPS donated to the project.

I can't imagine a real-world situation where one would take a similar WASTEFUL decision. Would an employer ever give the very same task to 5-6 different PAID employees? Not a chance (unless it's the public sector, trying to "create jobs").

Unless we're talking life-and-death (e.g. searching for a cure of a pandemic where every second counts) kind of results, using short deadlines (1 week) allows a project to re-issue any lost / errored WUs quickly enough.

BOINC projects should look at the credit model of Folding@Home, enjoying >200 TFLOPS sustained real processing power (not nominal speed), much higher than any DC project (SETI is very close in what is offered to it as "raw" TFLOPS).

I think many people would reconsider their CPU time donations to some projects, if they understood the amount of CPU time that they pay for, which is wasted.
17) Message boards : Feedback : Screensaver needs to be more fluid please (Message 446)
Posted 22 Feb 2006 by Dimitris Hatzopoulos
Post:
And btw, I just accidentally found out today (I never played with the graphics part of R) that by holding down the left-mouse-button and moving the mouse around you can rotate the 3D "native" protein and with the right-mouse-button held down and moving, you can zoom-in-out.

Nice!

Edit: Reading the other folks' comments more carefully, it seems everyone knew this feature all along <blush>.
18) Message boards : Feedback : Credit scores (Message 442)
Posted 22 Feb 2006 by Dimitris Hatzopoulos
Post:
Or Rosetta could go the CPDN route and do non-standard credits.

Since the new app shows how many models you do, a credit value could be assigned to each model.


I 100% agree. I'd want Rosetta to implement its own system. Btw, model times vary greatly between WU, sometimes my PC might do ~100 models in 8hrs and other times ~50 models (in 8hrs).

One BIG objection I have with current BOINC stats, is that "standard" Linux BOINC client benchmarks at about half the speed under Linux, than the very same hardware does under WinXP (whereas science apps run at the same speed under Linux and WinXP), resulting in ridiculously low credits for Linux.

So, after a month, I used an "optimized" BOINC client for Linux, which still doesn't fix things for projects that give out the lowest of 2 credits demanded (Predictor). So I get better credits from R, lower than fair from every other project, so overall should even out.

Also, personally I really DON'T WANT to see projects send the SAME WU to 5-7 different PCs, presumably "for redundancy of the science". E.g. both R@H and F@H are fine, in that they don't WASTE donor resources.

My CPU / bandwidth may be free to THEM, but it's NOT FREE to me. So I feel the projects have to respect donated resources and not waste or abuse them.

E.g. a few days ago I set "no-more-work" from WCG/HPF because after 1month since I pointed it to them, they still haven't fixed their BOINC-side to compress files (e.g. uploads results 2.5MB textfile, easily compressed to 500KB via gzip and even more with bzip/7zip/etc). And because they run a quorum of 5-7 afaik.

In some cases, where I'd really like to contribute to science, I'm willing to be extra patient with e.g. bandwidth requirements of Rosetta (until the recent fix).

I also hope the Rosetta team is considering a SSE-enabled science app, esp. since they said they use single-precision-floats mostly. SSE alone would boost TeraFLOPS by 4x (in reality 3.5x), so R would jump to 70 TFLOPS from current levels of 20-22TF.

Obviously, as Angus said, the majority of CPU power only cares about credits anyway, so I know I'm the minority.
19) Message boards : Current tests : Switching between projects with applications removed from memory (Message 407)
Posted 21 Feb 2006 by Dimitris Hatzopoulos
Post:
Can we now test the newest "production" R@H (Win/v4.82 and Linux/v4.81) executables with "Leave preempted app in mem"=NO ?

Otherwise, we still can't test RALPH (for this particular bug) and still run Rosetta@Home on same PC, as suggested per RALPH FAQ

20) Message boards : RALPH@home bug list : i can\'t connect (Message 371)
Posted 20 Feb 2006 by Dimitris Hatzopoulos
Post:
What I do, is to look at RALPH's (and Rosetta's) homepage http://ralph.bakerlab.org/, at the statistics displayed at top-right. Look at the "Queued: x" line, right now x=0, so there are simply no WUs to crunch.

I hope people are setting their "connect every" to very low intervals, so they don't drain the queue (i.e. get 10-20 WUs at a time) when they connect and there's work. Because the idea is to test as many PC and settings configurations as possible.


Next 20



©2024 University of Washington
http://www.bakerlab.org