Saturday, June 11, 2011

Parallelizing Function Calls: An Example Using the Ellipse Fit

Problem Statement

Can Parallel Python improve the performance of the ellipse fit algorithm? Under which conditions will Parallel Python offer performance advantages.

Discussion

Breaking an algorithm into pieces which are executed in parallel on multiple CPU’s can speed up execution time. One way to estimate the best theoretical improvement is to use Amdahl’s law. This law estimates the performance improvement by breaking an algorithm in a portion which can be parallelized and a portion which is serial in nature. This is an upper estimate of the benefits of parallelization.

In practical parallelization, there may be overheads associated with getting things running on multiple CPUs. In Python, there are several issues to consider. The first is that that C implementation of Python does not natively support true parallelization. This is associated with issues deep in the interpreter (search on Python GIL for more information). Therefore, any library that supports parallelization needs to work around the issues with the GIL. Implementations like JPython and IronPython do not suffer from these issues.

One easy to use library is Parallel Python. It allows a program to establish a set of local and remote servers which are passed  functions and all of the information for successfully call those functions. The relative ease of use is offset by the fact that when the server’s are setup, there is a time cost. Also, then passing the functions, parameters, and everything else, there is a time cost. The experiments here looked at the use of this library in the ellipse fitting problem and compared the execution time to other solutions.

Testing

To test the use of Parallel Python, the objective class previously developed was reused. An instance of this class behaves like a function by supporting the __call__ method. More importantly, the __init__ method and the __del__ method are overridden to create and destroy the Parallel Python job servers. To use these scripts, install them in the same directory as the scripts from here.

The first script implements the objective function, parallelizes it using Parallel Python. All of the calculations are performed using vectorized math. The objective function is implemented in the Objective class. In this class, when __init__ is called, the parameters used by the objective function are stored with the class instance, the number of parallel processes for execution are determined, and the Parallel Python jobs server is started. This causes a new instance of Python to be started for each process which will be used. When the __call__ method is invoke by a call to the class instance, then the calculation is broken up into pieces and dispatched to the job servers. When the objective function is no longer needed and the garbage collector invokes the __del__ method, the servers are destroyed.

objective_vectorized_parallel.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations which are
   parallelized using parallel python.

'''

from numpy import *
from numpy import linalg as LA
import objective_scalar
import pp
    
class Objective(objective_scalar.Objective):
    
    def __init__(self,parameters):
        objective_scalar.Objective.__init__(self,parameters)
        self.ncpus = parameters.get('ncpus','autodetect')
        self.job_server = pp.Server(self.ncpus)
        # because autodetect may have been used, use get_ncpus to 
        # get physical number of cpus
        self.ncpus = self.job_server.get_ncpus()
        self.job_server.set_ncpus(ncpus=self.ncpus)

    def __call__(self,x):
        '''
        Calculate the objective cost in the optimization problem using
           vectorized equations.
        '''
        
        point_list = self._p
        foci1 = array([x[1],x[2]])
        foci2 = array([x[3],x[4]])
        a     = x[0]
        n = float(len(point_list))
        _lambda = 0.1
        
        def solve_sub_problem(point_list,foci1,foci2,a):
        
            pt_diff1 = point_list - foci1
            pt_diff2 = point_list - foci2
         
            x_f1_diff = pt_diff1[:,0]
            x_f2_diff = pt_diff2[:,0]
            y_f1_diff = pt_diff1[:,1]
            y_f2_diff = pt_diff2[:,1]
         
            x_f1_diff_sq = numpy.power(x_f1_diff,2)   
            x_f2_diff_sq = numpy.power(x_f2_diff,2)   
            y_f1_diff_sq = numpy.power(y_f1_diff,2)   
            y_f2_diff_sq = numpy.power(y_f2_diff,2)
            
            norm_pt_to_f1 = numpy.power(x_f1_diff_sq+y_f1_diff_sq,0.5)
            norm_pt_to_f2 = numpy.power(x_f2_diff_sq+y_f2_diff_sq,0.5)
            
            temp = numpy.power(norm_pt_to_f1+norm_pt_to_f2-2*a,2)
            part_sum = numpy.sum(temp)
            return part_sum
    
        jobs = []
        numpts = n
        sigma    = self._sigma
        ahat_max = self._ahat_max
        inc = math.ceil(n/float(self.ncpus))
        endi = 0
        for i in range(0,self.ncpus):
            starti = endi
            endi = int(min(starti+inc,n))
            # make a copy of point list which is smaller
            #   to minimize the time in transferring to the
            #   parallel processes
            local_point_list = array(point_list)[starti:endi,:]
            jobs.append(self.job_server.submit(solve_sub_problem,
                (local_point_list,foci1,foci2,a),
                (),("numpy",)))
        total = sum([job() for job in jobs])/n    
        total += _lambda*ahat_max*sigma*exp((a/ahat_max)**4)
        
        return total
        
    def __del__(self):
        self.job_server.destroy()

if __name__=='__main__':
    import time
    from random import seed

    # local modules
    import ellipse
    
    ####################################################################

    # setup test conditions
    num_reps = 100    
    num_pts = 256
    precision = 'float'
    seed(1234567890)      # set the random generator to get repeatable results
    
    # setup the reference ellipse
    
    # define the foci locations
    foci1_ref = array([2,-1])
    foci2_ref = array([-2,1])
    # define distance from foci to ellipse circumference
    a_ref = 2.5
    
    point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)
    
    parameters = { "point_list" : point_list ,
                   "ncpus"      : 'autodetect'}


    # test the function
    t0 = time.time()
    my_objective = Objective(parameters)
    x0 = my_objective.x0
    t1 = time.time()
    for i in range(0,num_reps):
        y  = my_objective(x0)    
    t2 = time.time()
    
    print ''
    print 'Initialization took %f sec' % (t1-t0)
    print 'Using %i cpus' % my_objective.ncpus
    print 'Execution took %f sec' % (t2-t1)
    print 'Executed %i times.' % (num_reps)
    print ''

    ref_objective = objective_scalar.Objective(parameters)    
    # compare x0 calculation
    print ''
    print ('Difference between x0 calculations = %f' 
            % LA.norm(array(ref_objective.x0)-array(my_objective.x0)))
    print ('Difference between objective calcs = %f' 
            % (ref_objective(x0)-my_objective(x0)))
    print ''

 

The second script measures the execution times for this new objective implementation versus the previous scalar and vectorized implementations. One important item in this script is forcing the garbage collector to run after each test. Without doing this, when new classes are created, additional Python instances are created while the old instances are left running. This could have been solved through a more clever class design. However, for this testing, the simplest solution was to force the garbage collector to handle the issue.

 

Results

The first test measured the time to call the objective function once. This showed that the scalar solution was the slowest and the simple vectorized solution was the fastest. Surprisingly, parallelizing the problem (and using vectorization) resulted in a solution which was consistently between the scalar and vectorized solution.

 

vectorized parallel objective calls

The second test embedded the parallelized version of the objective function in the ellipse fit problem. In this implementation, the results were mixed. For small problems, the scalar solution performed best. As the number of points on the ellipse increased, the vectorized version provided the best results. However, the parallelized version appeared to converge towards the vectorized version for larger problems.

vectorized parallelized fit times

 

From this test, the conclusion was that parallelization using an approach like Parallel Python is not effective. One of the reasons for this is the large amount of data transfer that needs to happen to setup the function call. In this example, the ratio of computation to data transfer is low enough that the transfer mechanisms make the benefits of parallelization, for a system with 8 cpus, not worth the effort. If the data transfer were faster or the computation times were longer, then parallelizing using Parallel Python might have offered advantages.

 

Test Conditions:

Further Testing/Development

  • Evaluate the impact of imported modules
  • Evaluate ways to share read only resources among processes efficiently

References

This work is licensed under a Creative Commons Attribution By license.

No comments:

Post a Comment