In the last week I tried a couple of interesting things, which well speed up the test.
Next following article, I have implemented the better $s$ reduction function (_jacobi_config_reduce_s2()) which is the approximation of knapsack problem solution. We choose weight of each $q$ equal to $w(q) = \sum\limits_{p | q-1}{\phi^2(p^j)}$, where $\phi$ denotes Euler's function. Then we find a prime power factor $q^k$ of $s$ with max $w(q)$ and set $s = s/q^k$. By doing this as long as possible, we will get a good $s$ value. It also gave a small speed boost. For better explanations and links you can look at chapter 4 of "Primality Testing and Jacobi Sums". As I see it is not implemented in PARI/GP aprcl.
I add config functions, which try to reduce $s$ value. If $s$ have prime power factor $q^k$ for which $s/q^k > n^{1/2}$ then we can choose biggest $q$ and set $s = s/q^k$. This is repeated until it is no longer possible (_jacobi_config_reduce_s()). It gave a nice speed boost for numbers $n$ which are much smaller than the computed number $s$. This optimization enabled us to slightly outperform PARI/GP and "smooth" time of computation (no spikes).
up to 300 digits |
Another thing which gave interesting results - I remove $t = 166320 = 2^4*3^3*5*7*11$ and always use $t = 720720 = 2^4*3^2*5*7*11*13$. If we use it together with $s$ reduction function we have ~0.5 second speed up for 300 digits prime (5.6 sec instead of 6.1). These is because we do not have a special sqr/mul functions for p = 13 or p = 27, but for p = 27 we have poly of length 17 and for p = 13 length = 12. After using $s$ reduction function in both cases we have $s$ of similar size, but use faster multiplication/squaring. It works for all numbers which are used t = 166320. Also in this place there is space for optimization, because now in final divison step we need to do 720720 multiplications and divisions, and it takes ~15% of all time.
The next thing to look at in the future is special functions sqr/mul for all $p$ values. I look at p = 13 and write result for squaring and I don't think that's a good idea to work with it manually. It look like some type of cse task and I think I can find some algorithm for this later...
I also add special multiplication functions for p = 3, 4, 5, 7, 8, 9, 11 and 16. It gave just a couple of percent increase, but almost for free :)
So, now test works 0.115 sec for 100 and 5.6 sec for 300 digits numbers (I think I have never posted my processor type - Core2Duo T6670 2.2GHz processor, so I do not have avx instructuions set, which may be can be used in gmp).
This week I want to fix all warnings I have, do some refactoring and add more documentation. I also want to add something similar to profile folder with primes of different sizes (from https://primes.utm.edu/lists/small/ , for example) and outputs. I have it for me, but it was written not very carefully and not committed.