Fast and elegant way to sum primes in a gigantic range

The problem is taken from Project-Euler, which asks what is the sum of all prime numbers under 2 million.

Traditional Approach

Project-Euler has many problems like this which looks ridiculously easy in theory, but practically impossible when using the old-school brute force way to solve them.

Even after applying some well-known techniques to shrink the problem space, the computation still takes a long time (too long for me to stick around and wait it to finish). I tried using the sieve of Eratosthenes, and the performance isn’t satisfying to say the least.

solve() ->
    sum_all_primes_under(2000000).

sum_all_primes_under(To) ->
    sieve(lists:seq(2, To), To).

sieve([], _) -> 0;
sieve(List, Range) ->
    [Max|_] = lists:reverse(List),
    case Max  List;
        false ->
            [H|Rest] = List,
            io:format("~p ~n", [H]),
            H + sieve(lists:filter(fun(X)->X rem H =/= 0 end, Rest), Range)
    end.

One thing to take note is that when executing the above program, only one core of the CPU is effectively utilized.

CPU usage

Concurrency Oriented Programming

In the concurrency oriented programming (COP, a term coined by Erlang’s inventor Joe Armstrong) model, you’re thinking in terms of separate entities and what responsibility is there for each of them. Think of them as individual human beings. They don’t have shared brains; one cannot influence the other by simply modify his or her brain tissues (well, under normal circumstance anyway). As human beings, our brains are disconnected from each other, *but* we share our knowledge through communication. We can modify other people’s memories by sending messages and vice versa. This is essentially what COP is – the entities are processes, and each has individual memories that’s not modifiable by other processes. Processes communicate with each other through message sending/receiving. This model is particularly good for scaling. Because the entities share nothing, it’s easy to get more entities without worrying that they don’t get along.

Back to our problem. With the traditional approach, one process is to iterate through the 2 million numbers, determine if it’s a prime number, and add to the tally if it is. Think of it as one man doing all the job that there is.

With the new model, here are the actors in our system:

Worker: A worker is responsible for gathering the prime numbers in a given range (sufficiently small), add them and report them to the tallier.

Tallier: A tallier waits for the results from the workers and sum the results up.

Manager (Solver): The manager is responsible for dividing up the task, hiring workers to finish the task and tell the tallier which worker has been given a task so that the tallier can wait for the specific workers to send back the results.

Worker code:

worker() ->
    receive
        {TallierPid, L} ->
            Sum = lists:sum(lists:filter(fun(X)->mylib:is_prime(X) end, L)),
            TallierPid ! {self(), Sum}
    end.

The worker code is quite simple. After it’s created (hired by the manager), it waits for the manager to send over {TallierPid, L}, with L being the list of integers to work with. Then it filters out the prime numbers, and call lists:sum() on the result list, which gives the sum of the prime numbers in the list L.

This doesn’t seem to look different if we were writing in the traditional way. Yes, but feeding the key difference here is that we’re not going to ask one worker to solve the entire problem domain. Instead, the manager is going to divide the task into smaller tasks so that it takes almost no time for a single worker to finish his or her work.

Manager (Solver) code:

solve() ->
    TallierPid = spawn(euler, tallier, [self(), [], 0, false]),
    solver(TallierPid, 2, 2000000),
    %% worker assignment has finished
    TallierPid ! { self(), true },

    receive
        {TallierPid, Result} ->
            io:format("Final tally: ~p ~n", [Result])
    end.
...

This is only half of the manager code. First it spawns a tallier. We’ll look at the tallier code later.
solver(TallierPid, 2, 2000000) is the worker assignment part, which is shown directly after this section.

After the worker assignment, the manager simply informs the tallier that the worker assignment has done, and sits there, waiting for the tallier to report the final result.

...
solver(TallierPid, From, To) ->
    case (To - From) of
         SmallRange when SmallRange =< 1000 ->
            List = lists:seq(From, To),
            WorkerPid = spawn(fun worker/0),
            %% register the worker with tallier
            TallierPid ! { self(), WorkerPid },
            %% send the problem for the worker to work out
            WorkerPid ! { TallierPid, List };
        _ ->
            Mid = mylib:floor((To + From) div 2),
            solver(TallierPid, From, Mid),
            solver(TallierPid, Mid+1, To)
    end.

This code defines the problem domain, splits the work, hire workers and assign the split work to them. It’s a classic divide-and-conquer structure, however, we have to be careful not to make the range too small, as the number of processes may (or in this case, will) exceed the default system limit. As I tested, my computer can sum prime numbers from a list of 1000 elements in a heartbeat so I just chose that number to be the upper bound of the size for each worker to work on.

Finally, the

Tallier code:

tallier(SolverPid, Workers, X, AssignFinished) ->

    receive
        {SolverPid, true} ->
            %% This message is from the solver
            %% It tells the tallier that
            %% worker assignment has finished
            tallier(SolverPid, Workers, X, true);
        {SolverPid, WorkerPid} ->
            %% This message is from the solver
            %% Register WorkerPid
            tallier(SolverPid, [WorkerPid | Workers], X, AssignFinished);
        {WorkerPid, Sum} ->
            Tally = X + Sum,
            %% Remove worker from Workers list
            NewWorkerList = Workers -- [WorkerPid],
            case AssignFinished of
                true ->
                    case NewWorkerList of
                        [] ->
                            SolverPid ! { self(), Tally };
                        _ -> tallier(SolverPid, NewWorkerList, Tally, AssignFinished)
                    end;
                _ -> tallier(SolverPid, NewWorkerList, Tally, AssignFinished)
            end
    end.

The tallier keeps track of 4 “variables”:
1) The Solver’s ID – to recognize messages from the solver and to report final tally to the solver.
2) A list of Workers’ IDs whose results have been reported yet.
3) X – The previous tally.
4) AssignFinished – a flag indicating that all workers have been assigned.

The tallier has to look after three types of messages:
1) Message from the manager that the job has been divided up and the workers have been dispatched. In this case, the tallier knows that when the Workers list gets down to empty, the computation is finished.
2) Message from the manager that registers a worker with the tallier. The tallier adds this worker’s ID to the Workers list and waits for that worker to return his or her result.
3) Message from the worker indicating the worker’s ID and the result. The tallier removes the worker’s ID from the waiting pool and adds the result to the tally. When the waiting pool is empty and AssignFinished is true, it signals the end of the job and thus the tallier reports the final tally to the manager (solver).

There we go!  Running the program on my work computer (QuadCore) gives me the correct result in less than 5 seconds. About 8 seconds for my laptop (which is a Core 2 Duo). Non of this is even remotely imaginable with the traditional approach.

Also, as you may have guessed, all cores on the CPU are fully utilized.

CPU usage after

Programming in Erlang is fun!

Advertisements
Leave a comment

10 Comments

  1. A very nice example of map-reduce, thanks for sharing!

    Reply
  2. Nice solution. It is weird how slow the sieve in Erlang is. My colleague has a Java sieve using byte array and bit shifting that takes .5 seconds to calculate problem 10. I like the fact that you highlight the message passing/actor simplicity in Erlang. I have (quite behind at the moment) blog of solutions to Euler in Erlang. My problem 10 seems to use both CPUs since I upgraded to R13B. It takes under 3 seconds on a dual core laptop, which I’m pretty proud of…but it doesn’t demonstrate Erlangs COP advantages, like yours does.

    %%
    %test for primality
    %%
    is_prime2(N) when N =:= 2 -> true;
    is_prime2(N) when N rem 2 =:= 0 -> false;
    is_prime2(N) ->
    Limit = round(math:sqrt(N) + 1),
    is_prime_tester(N, Limit, 3).

    is_prime_tester(N, Limit, CandidateFactor) when CandidateFactor
    case N rem CandidateFactor =:= 0 of
    true -> false;
    false ->
    is_prime_tester(N, Limit, CandidateFactor+2)
    end;
    is_prime_tester(_N, _Limit, _CandidateFactor) ->
    true.

    prob10_alt(N) ->
    Primes = [A || A <- lists:seq(3, N, 2), is_prime2(A)],
    lists:sum([2|Primes]).

    Thanks for sharing your code. I look forward to more posts on Erlang.

    Reply
  3. This all hinges on the efficacy of the function my_lib:is_prime/1 of course, could you list that too?

    With a decent prime check just lists:filter on 2000000 works in under 3 seconds so chances are your multiprocess code is actually slower than the alternative.

    😀

    Reply
    • jchiu1106

       /  June 3, 2009

      Hi Russell,

      My is_prime method is quite simple: basically testing every number from 2 to sqrt(N).

      Regarding the sieve of Eratosthenes, my colleague pointed out that the complexity of my implementation is actually n*sqrt(n) instead of n*log(n), because in Erlang, lists are linked lists. I think using array and bit trick can greatly reduce the problem size.

      Thanks for taking time writing the comments and the encouragement!

      Reply
  4. Nathan Kidd

     /  June 9, 2009

    (Not having taken the time to grok erlang) I’m trying to reconcile
    “the computation still takes a long time … only one core of the CPU is effectively utilized”
    and
    “result in … 8 seconds … all cores on the CPU are fully utilized”

    Unless by “long time” you mean < 16 seconds (since there's some message-passing overhead), I don't see how dividing the job up to fully use e.g. 2 cores, you gain anything better than O(n) processing power (i.e. split in half). Did I miss something?

    Reply
    • jchiu1106

       /  June 9, 2009

      The algorithm divides the problem set in half on every recursion, disregard the size of my final problem set for each worker to solve being 1000, the asymptotic complexity is still O(log(n)). Asymptotically, this algorithm is no different than any other divide-and-conquer, but with sequential algorithms, the work has to be done by the process that creates it eventually, but in this model, the worker process is created and executed right after the problem set size threshold is met (1000 in my case), so the calculation happens at the same time as the splitting in different processes, and that’s what I suspect is the main reason for the speed boost.

      Regarding the overhead of creation of processes and message passing, Erlang processes are very light in weight and interprocess RPC is also extremely efficient, because they’re not processes on the OS level, but rather on the level of the Erlang Virtual Machine. I’m not an Erlang expert, so my characterization here may not be accurate.

      Reply
  5. very very thank you

    Reply
  6. Johan

     /  October 6, 2009

    Well, looking at your sieve/2 I don’t know what your doing. Take a look at it again and try to fix it.

    Then your second solution is not very good since it will check primarity of a number N by checking if it’s dividable by any number from 2 to sqrt(N). Now if we have checked if it’s not dividable with 2 do we have to check if it’s not dividable with 4? The idea with the sieve implementation is that we gradually know about more primes and filter out all non primes and you only have to use the primes to check the rest of the numbers (if we filter with 2 we filter out four, do the next filter with 3, then 5 (not 4), then 7 (not 6) etc).

    How to take advantage of this and still make it run in parallel is tricky. Or, there is a simple parallelization but it will involve alot of message passing. It’s quite fun to build though.

    Reply
  7. Best Core Processor i love that . Thank for sharing

    Reply
  8. I solved this task on my corei5 in a single thread in less then ~2seconds. Checking if a number is prime for only odd numbers in a range 2..sqrt(n) works fine for 2millions. When it comes to tens of millions this approach gets slow.

    Reply

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: