Tuesday, 28 July 2015

GSoC week 9; Special multiplication and s reduction

In the last week I tried a couple of interesting things, which well speed up the test.

I add config functions, which try to reduce $s$ value. If $s$ have prime power factor $q^k$ for which  $s/q^k > n^{1/2}$ then we can choose biggest $q$ and set $s = s/q^k$. This is repeated until it is no longer possible (_jacobi_config_reduce_s()). It gave a nice speed boost for numbers $n$ which are much smaller than the computed number $s$. This optimization enabled us to slightly outperform PARI/GP and "smooth" time of computation (no spikes).
up to 300 digits
Next following article, I have implemented the better $s$ reduction function (_jacobi_config_reduce_s2()) which is the approximation of knapsack problem solution. We choose weight of each $q$ equal to $w(q) = \sum\limits_{p | q-1}{\phi^2(p^j)}$, where $\phi$ denotes Euler's function.  Then we find a prime power factor $q^k$ of $s$ with max $w(q)$ and set $s = s/q^k$. By doing this as long as possible, we will get a good $s$ value. It also gave a small speed boost. For better explanations and links you can look at chapter 4 of "Primality Testing and Jacobi Sums". As I see it is not implemented in PARI/GP aprcl.

Another thing which gave interesting results - I remove $t = 166320 = 2^4*3^3*5*7*11$ and always use $t = 720720 = 2^4*3^2*5*7*11*13$. If we use it together with $s$ reduction function we have ~0.5 second speed up for 300 digits prime (5.6 sec instead of 6.1). These is because we do not have a special sqr/mul functions for p = 13 or p = 27, but for p = 27 we have poly of length 17 and for p = 13 length = 12. After using $s$ reduction function in both cases we have $s$ of similar size, but use faster multiplication/squaring. It works for all numbers which are used t = 166320. Also in this place there is space for optimization, because now in final divison step we need to do 720720 multiplications and divisions, and it takes ~15% of all time.

The next thing to look at in the future is special functions sqr/mul for all $p$ values. I look at p = 13 and write result for squaring and I don't think that's a good idea to work with it manually. It look like some type of cse task and I think I can find some algorithm for this later...

I also add special multiplication functions for p = 3, 4, 5, 7, 8, 9, 11 and 16. It gave just a couple of percent increase, but almost for free :)

So, now test works 0.115 sec for 100 and 5.6 sec for 300 digits numbers (I think I have never posted my processor type - Core2Duo T6670 2.2GHz processor, so I do not have avx instructuions set, which may be can be used in gmp).

This week I want to fix all warnings I have, do some refactoring and add more documentation. I also want to add something similar to profile folder with primes of different sizes (from https://primes.utm.edu/lists/small/ , for example) and outputs. I have it for me, but it was written not very carefully and not committed.

Tuesday, 21 July 2015

GSoC week 8; More squaring functions

This week not much has changed, but I tried a couple of things that gave nothing.

After using Montgomery form for general multiplication and squaring I try to use it in special sqr functions (like unity_zp_sqr7). For this I add some modular functions for fmpz in Montgomery form (add, sub, mul: inlined in aprcl.h file) and add unity_zp_mont_sqr7() function. Using it instead of unity_zp_sqr7 and simple form works slower. Maybe, I implement something not very accurately.

I also add unity_zp_sqr11() function and it gave a good speed up (~10% for 300 digits number; we can test 300 digits $n$ for 7.3 sec). Functions like this give a better performance, but them require a lot time to code and debug. So far I have used the Supplement to Implementation of a New Primality Test, but the better way is to think about generalization.
from 50 to 100 digits

from 50 to 300 digits


On the other hand if I add similar functions for $p^k = 13, 27, 32$ this will be enough for the numbers up to ~1000 of digits. I also need to have a better look at $R$ and $s$ selection.

I think that I need to rewrite unity_zp_sqr and unity_zp_mul functions to reduce modulo $n$ after cyclotomic poly reduction, so we have less number of fmpz_mod function calls. 
Only 4 weeks left until soft GSoC deadline, so I have to start thinking about future code integration (as first step I want to create new branch and remove files related with Gauss sum test).

Valgrind callgrind call graphs for 100 and 300 digits numbers.

Tuesday, 14 July 2015

GSoC week 7; Montgomery modular stuff

This week I add squaring for p = 16 and it give good speed up for 100 digit prime (actually, for ~100 digit number used my implementations of squaring). After this we can prove 100 digit number primality for 0.13 sec and 300 digit number for 8.3 sec.

The next thing I was working on is Montgomery modular multiplication and reduction.
Firstly, we need to convert our unity_zp structure (recall that it is just a fmpz_mod_poly, mod = n) into Montgomery. For this we need to select $r > n$ such that $r = 2^k$ and compute $f * r$ there $f$ - unity_zp; so, element $x$ in Montgomery form can be represented as $x = a * r$, there $a$ from simple form. Because we do not want to use fmpz_mod or division functions we can store our unity_zp_mont data into fmpz_poly.  It was implemented in unity_zp_mont_to() and init functions.

Secondly, we need to implement basic operations. Addition and subtraction are the same as ordinary modular addition and subtraction: $x + y = a*r + b*r = (a + b)r = z$ there $x$, $y$ and $z$ in Montgomery form. Multiplications works a little harder: $x * y = (a*r) * (b*r) = (a*b)*r^2$, so we need to divide by $r$. To do it fast exists special REDC function, which uses the fact that $r = 2^k$ and we can use binary shift operations instead of division.

Using this form in exponentiation function gives small speed up (8 sec instead of 8.3 for 300 digits number). I think, if I implement REDC function in lower level (GMP have some implementations, but them not have public interface) this will give even a small increase in speed. Moreover, I not use REDC for small p = 4, 5, 7, 9 and 16, so Montgomery stuff not used for small numbers.

Sunday, 5 July 2015

GSoC week 6; Midterm and more optimizations

This week I refactor powering algorithms and add sliding window exponentiation implementation. It gave speed boost in couple of percent.

The next thing I do is add special squaring functions for small p values: powers of 2 (for 4 and 8); 3; 5; 7 (for details see the "Implementation of a New Primality Test" supplement). Selection of function going on unity_zp_sqr_inplace() and we need to have allocated memory for fmpz_t array (t).

It gives good speed boost (~10-15%), and for "small" numbers (near 100 digits) we catch up the PARI/GP implementation and I'm confident that we will soon become faster. For bigger values PARI/GP is still faster. Below the graphs with timings up to 100 and 300 digits:
50-100 digits

50-300 digits


I also looked at $(n - 1)$ test to replace computation in $\mathbb{Z}[\zeta_r]/(n)$ by $\mathbb{Z}/n\mathbb{Z}$, but I don't see that it will give a good speed gain.
Now I want to try reduce s value for fixed R (as in (4.) of "Primality Testing and Jacobi Sums"). After we compute s we can try to remove some factors without which we still have $s^2 > n$.

In total, over the past six weeks has been implemented APR-CL algorithm with good performance. Almost all of code is tested and documented. So now I need to finish special cases of multiplication and squaring and then try to optimize in general. Also we need to look closely at different initial values of R because the values from PARI/GP impair the performance (look at 70 digit numbers in first plot, for example).