MC on the IBM Cell..

nd · June 1st, 2006, 10:06 pm

I thought I'd share any interesting result. Just for grins I ported a reallysimple linear congruence RNG(when I say simple I mean simple)to the IBM Cell. And ran a coupla of test... The results are (in humbleopinion) pretty interesting.To generate a 1e6 random numbers on a 64 bit Opteron takes .106481 secs. On the Cell it takes .00097 secs.If you have any ideas/criticisms,etc... please let mendps.. code is below. ---mc cell---------#include "../dma_sample.h"#include <cbe_mfc.h>#include <spu_mfcio.h>#include <stdio.h> #define lcg_mod(X) 2<<Xtypedef vector unsigned int v_uint;typedef vector float v_float;static __inline vector unsigned int mod_v(vector unsigned int dividend, vector unsigned int divisor){ unsigned int done; vector unsigned int cnt, cnt_d; vector unsigned int delta, term, cmp; cnt_d = spu_cntlz(divisor); divisor = spu_andc(divisor, spu_cmpgt(divisor, dividend)); done = spu_extract(spu_gather(spu_cmpeq(divisor, 0)), 0); while (done != 0xF) { cnt = spu_cntlz(dividend); delta = spu_sub(cnt_d, cnt); term = spu_rl(divisor, (vector signed int) delta); cmp = spu_cmpgt(term, dividend); term = spu_rlmask(term, (vector signed int) cmp); dividend = spu_sub(dividend, term); divisor = spu_andc(divisor, spu_cmpgt(divisor, dividend)); done = spu_extract(spu_gather(spu_cmpeq(divisor, 0)), 0); } return (dividend);}int main(unsigned long long speid, addr64 argp, addr64 envp){ int iterations = 32000; int i; unsigned int time1; unsigned int time2; // temp vars v_uint tmp; v_uint tmp2; //needed for last spu_add; v_float zeros = { 0, 0, 0, 0 }; // lcg v_float mod = { lcg_mod(15), lcg_mod(16), lcg_mod(17), lcg_mod(18) }; v_float a = { 13821, 47485, 82669, 160461 }; v_float b = { 0, 0, 0, 0 }; v_float x = { 1, 1, 1, 1 }; spu_write_decrementer(0x7fffffff); time1 = spu_read_decrementer(); for (i = 0; i < iterations; i++) { v_float sum = spu_madd(a, x, b); //convert from uint to float; because mod_v (modulus // func needs unsigned ints tmp = spu_convtu(sum, 0); tmp2 = spu_convtu(mod, 0); sum = spu_convtf(mod_v(tmp, tmp2), 0); x = spu_add(sum, zeros); } time2 = spu_read_decrementer(); printf("lcg = %i\n", time1 - time2); return 0;}

pomeron · June 2nd, 2006, 7:57 am

woul you mind telling us about the hardware and setup you used?Cell seems to be a nice project to play around with.Cheers Michael

tibbar · June 3rd, 2006, 2:15 pm

i guess when the PS3 comes out, they will make good linux machines

nd · June 5th, 2006, 3:26 pm

w>>oul you mind telling us about the hardware and setup you used?>>Cell seems to be a nice project to play around with.>>Cheers>>MichaelMichael,We have an 2 blade cell system (1 cell per blade) that lives in aIBM blade center. Each blade runs a copy of Fedore Core 4 witha few patches (which I think are now in Fedora Core 5).We use a slightly modified version of the gcc toolchain. Soall the normal stuff g++,objdump,etc works... There is a change in the way executables are built, the PPU and SPU object codes are linked together during the compile.On a side note: I think these multicore systems are here to stay.Learning how to program these things is going to be important, I dont'think normal multithreading models are going to work, the arithmetic intensityis just too small. nd