Showing posts with label Ellipse. Show all posts
Showing posts with label Ellipse. Show all posts

Friday, June 17, 2011

An example of improving performance using PyOpenCL: Further improvements to the Ellipse Fit Program

openCL objective call times

Figure 1

Problem Statement:

Can using OpenCL with Python significantly improve the performance of the Ellipse Fit problem? This problem was introduced here and improved on here using vectorization, and here using parallelization.

Discussion:

OpenCL is an open, royalty free, cross platform standard for parallel programming on heterogeneous systems. PyOpenCL is a wrapper for OpenCL which allows blending an OpenCL program into a Python script. PyOpenCL requires some setup to incorporate into a script. Any OpenCL program must be configured to work with a Python script. In general, a text representation of the OpenCL function (work task) is created. The device which will execute the program needs to be selected. The function needs to have input and output buffers defined to move data between the main program and the device which will execute the work task. Before the work task is called, the input buffers must be filled with data. The text representation of the work task must be compiled before execution. Finally, the function is called and it loads data into the output buffers when it completes.

One advantage of OpenCL is the ability to move a program among different devices for execution. One of the disadvantages is that there are multiple tuning aspects to tune which can affect performance. The best configuration for one device is usually not optimal for other devices.

Testing Setup:

To recreate these experiments, recreate the test setup here, then add the objective function from here. Then add the new function objective_openCL.py. Finally, overwrite the main.py script with the one here.

objective_openCL.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations.

'''

from numpy import *
import pyopencl as cl
from string import Template

# local imports
import objective_scalar


class Objective(objective_scalar.Objective):

    def __init__(self,parameters):
        objective_scalar.Objective.__init__(self,parameters)
   
        precision_dict = {
            'float':float32,
            'double':float64}    
            
        self.platform_name      = parameters.get('platform_name',None)
        self.platform_precision = parameters.get('platform_precision','double') 
        self.partial_sum_loc    = parameters.get('partial_sum_loc','__global')           
            
        self.numpy_prec = precision_dict[self.platform_precision]
        openCL_prec = self.platform_precision
    
        point_list = self._p    
    
        x,y  = zip(*point_list)
        self.x  = array(x,dtype=self.numpy_prec)
        self.y  = array(y,dtype=self.numpy_prec)
        self.f1 = array([0,0],copy=True,dtype=self.numpy_prec)
        self.f2 = array([0,0],copy=True,dtype=self.numpy_prec)
        self.a  = array([0],copy=True,dtype=self.numpy_prec)
        
        self.num_elements = len(x)
        
        platforms = cl.get_platforms()
        if len(platforms)>1 and not self.platform_name is None:
            # select the the first device in list
            for platform in platforms:
                if self.platform_name in platform.name:
                    # use first device for context
                    for device in platform.get_devices():
                        self.ctx = cl.Context([device])
                        self.device = device
                        break
                    break
        else:
            self.ctx = cl.create_some_context()    
            
        if 'NVIDIA' in self.platform_name:
            compiler_options = '-cl-mad-enable -cl-fast-relaxed-math'
        else:
            compiler_options = ''
        
        #self.ctx = cl.create_some_context()
        self.queue = cl.CommandQueue(self.ctx)
        
        # define the input buffers and transfer data into them
        mf = cl.mem_flags
        self.x_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.x)
        self.y_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.y)
        self.f1_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f1)
        self.f2_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f2)
        self.a_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a)
    
        # define the output buffer
        self.partial_sum_buf = cl.Buffer(self.ctx, mf.READ_WRITE, self.x.nbytes)
        self.full_sum_buf    = cl.Buffer(self.ctx, mf.READ_ONLY, array([0],dtype=self.numpy_prec).nbytes) 

    
        # prep the container to the output
        self.partial_sum = empty(self.x.shape,dtype=self.numpy_prec)
        self.full_sum    = empty(array([0],dtype=self.numpy_prec).shape,dtype=self.numpy_prec)
    
        # define local memory if needed
        if not self.partial_sum_loc=='__global':
            self.local_partial_sum = cl.LocalMemory(self.x.nbytes)            
    
        code_pattern= Template("""
            #pragma OPENCL EXTENSION cl_khr_fp64: enable
            __kernel void partial_sum(
            __global const ${precision} *x,
            __global const ${precision} *y, 
            __global const ${precision} *f1,
            __global const ${precision} *f2,
            __global const ${precision} *a,
            ${partial_sum_loc} ${precision} *partial_sum)
            {
              int gid = get_global_id(0);
              
              ${precision} x_f1_diff = x[gid]-f1[0];
              ${precision} x_f2_diff = x[gid]-f2[0];
              ${precision} y_f1_diff = y[gid]-f1[1];
              ${precision} y_f2_diff = y[gid]-f2[1];
              
              ${precision} x_f1_diff_sq = x_f1_diff*x_f1_diff; 
              ${precision} x_f2_diff_sq = x_f2_diff*x_f2_diff;
              ${precision} y_f1_diff_sq = y_f1_diff*y_f1_diff;
              ${precision} y_f2_diff_sq = y_f2_diff*y_f2_diff;
              
              ${precision} norm_pt_to_f1 = sqrt(x_f1_diff_sq+y_f1_diff_sq);
              ${precision} norm_pt_to_f2 = sqrt(x_f2_diff_sq+y_f2_diff_sq);
        
              ${precision} t1 = norm_pt_to_f1+norm_pt_to_f2-2*a[0];
        
              partial_sum[gid]=t1*t1;
            }
            
            __kernel void full_sum(
            ${partial_sum_loc} ${precision} *partial_sum,
            __global ${precision} *full_sum)
            {
              long i;
              ${precision} sum=0;
              
              for(i=0; i<${num_elements}; i++) {
                  sum += partial_sum[i];
                  }
              full_sum[0] = sum;
            }            
            
            __kernel void full_sum2(
            __global ${precision} *partial_sum)
            {
              int gid = get_global_id(0);
              long i;
              long mid_point = (${num_elements}+1)/2;
              if (gid>mid_point) {
                  return;
              } else {
                  partial_sum[gid]+=partial_sum[gid+mid_point];
              }
            }             
        """)
        substitutions = {'precision':openCL_prec,
                         'num_elements':self.num_elements,
                         'partial_sum_loc':self.partial_sum_loc}
        prg = code_pattern.substitute(substitutions)
    
          
    
        # define the function to be executed
        self.prg = cl.Program(self.ctx, prg).build(options=compiler_options)


    def __call__(self,x):
        '''
        Calculate the objective cost in the optimization problem.
        '''
            
        a      = x[0]
        self.a  = array([x[0]],copy=False,dtype=self.numpy_prec)
        self.f1 = array([x[1],x[2]],copy=False,dtype=self.numpy_prec)
        self.f2 = array([x[3],x[4]],copy=False,dtype=self.numpy_prec)
        
        # trigger transfer of values to the GPU buffers
        # only those values which changed before this evaluation need to be updated
        mf = cl.mem_flags        
        self.f1_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f1)
        self.f2_buf = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.f2)
        self.a_buf  = cl.Buffer(self.ctx, mf.READ_ONLY | mf.COPY_HOST_PTR, hostbuf=self.a)
      
        # execute the partial_sum function
        if self.partial_sum_loc == '__global':
            self.prg.partial_sum(self.queue, self.x.shape, None,
                        self.x_buf,self.y_buf,self.f1_buf,self.f2_buf,self.a_buf,self.partial_sum_buf)
            # trigger the copy of the results into the partial_sum container after completion
            cl.enqueue_read_buffer(self.queue, self.partial_sum_buf, self.partial_sum).wait()
            # execute the full_sum function
            self.prg.full_sum(self.queue, (1,), None,
                            self.partial_sum_buf,self.full_sum_buf)
        else:
            self.prg.partial_sum(self.queue, self.x.shape, (8,),
                        self.x_buf,self.y_buf,self.f1_buf,self.f2_buf,
                        self.a_buf,self.local_partial_sum)
            self.prg.full_sum(self.queue, (1,), None,
                        self.local_partial_sum,self.full_sum_buf)
                        
    
        # trigger the copy of the results into the partial_sum container after completion
        cl.enqueue_read_buffer(self.queue, self.full_sum_buf, self.full_sum).wait()

        # compare sums
        #print (sum(self.full_sum)-sum(self.partial_sum))
        #print (sum(self.full_sum),sum(self.partial_sum))

        # finish up the calculations 
        point_list = self._p
        n = float(len(point_list))
        _lambda =0.1
        #full_sum = sum(self.partial_sum)
        total = (self.full_sum/float(n) + 
            _lambda*self._ahat_max*self._sigma*exp((a/self._ahat_max)**4))
        
        return total


    def _del_(self):
        pass


if __name__=='__main__':
    import time
    from random import uniform,normalvariate,seed
    # local modules
    import ellipse
    
    ####################################################################

    # setup test conditions
    num_reps = 1   
    num_pts = 1e6
    precision = 'float'
    seed(1234567890)      # set the random generator to get repeatable results
    
    # setup the reference ellipse
    
    # define the foci locations
    foci1_ref = array([2,-1])
    foci2_ref = array([-2,1])
    # define distance from foci to ellipse circumference
    a_ref = 2.5
    
    point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)
    
    parameters = { "point_list" : point_list,
                   "platform_name" : 'NVIDIA',
                   "platform_precisions" : precision}
    
    # test the function on an NVIDIA card
    t0 = time.time()
    my_objective = Objective(parameters)  
    x = my_objective.x0

    t1 = time.time()
    for i in range(0,num_reps):
        y  = my_objective(x)    
    t2 = time.time()
    
    print ''
    print 'NVIDIA initialization took %f sec' % (t1-t0)
    print 'NVIDIA execution took %f sec' % (t2-t1)
    print 'Executed %i times on %s' % (num_reps,my_objective.device.name)
    print ''
    
    # test the function on an intel CPU
    
    parameters = { "point_list" : point_list,
                   "platform_name" : 'Intel',
                   "platform_precisions" : precision}
    
    
    t0 = time.time()
    my_objective = Objective(parameters)
    x = my_objective.x0

    t1 = time.time()
    for i in range(0,num_reps):
        y  = my_objective(x)    
    t2 = time.time()    
    
    print ''
    print 'Intel initialization took %f sec' % (t1-t0)
    print 'Intel execution took %f sec' % (t2-t1)
    print 'Executed %i times on %s' % (num_reps,my_objective.device.name)
    print ''
    

 

main.py

'''
This module tests the performance of different ways of
   forming the ellipse fit objective function and
   its impact on the ellipse fit optimization
   problem.
'''

__author__ = 'Ed Tate'
__email__  = 'edtate<at>gmail-dot-com'
__website__ = 'exnumerus.blogspot.com'
__license__ = 'Creative Commons Attribute By - http://creativecommons.org/licenses/by/3.0/us/'''


import time
from random import uniform, normalvariate, seed
from numpy import *
import pylab as plt
from openopt import NLP
import gc

# local modules
import ellipse
import objective_scalar
import objective_vectorized
import objective_vectorized_parallel
import objective_openCL

####################################################################

# setup test conditions
num_reps = 2

seed(1234567890)      # set the random generator to get repeatable results

# setup the reference ellipse

# define the foci locations
foci1_ref = array([2,-1])
foci2_ref = array([-2,1])
# define distance from foci to ellipse circumference
a_ref = 2.5

######################################################
# setup the OpenCL Objective classes for CPU and GPU verions of the 
#     OpenCL version of the objective function

class objective_openCL_GPU_Objective(objective_openCL.Objective):
    def __init__(self,parameters):
        parameters["platform_name"]="NVIDIA"
        objective_openCL.Objective.__init__(self,parameters)
        
class objective_openCL_CPU_Objective(objective_openCL.Objective):
    def __init__(self,parameters):
        parameters["platform_name"]="Intel"
        objective_openCL.Objective.__init__(self,parameters)

#####################################################
#
objective_functions = [objective_scalar.Objective,
                       objective_vectorized.Objective,
                       objective_vectorized_parallel.Objective,
                       objective_openCL_GPU_Objective,
                       objective_openCL_CPU_Objective]
                      
legends = ['scalar','vectorized','vectorized parallel',
           'OpenCL on NVIDIA GPU',
           'OpenCL on Intel i7 CPU']

######################################################
# measure the timing to setup and call the objective function

num_pts_list = [10,100,250,500,1000,2500,5000,10000,50000,1e5,1e6]
#num_pts_list = [10,100,250]


setup_times = []
call_times  = []
for num_pts in num_pts_list:

    seed(1234567890)      # set the random generator to get repeatable results    
    point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)
    
    parameters = { "point_list" : point_list }
    
    test_setup_times = []
    test_call_times = []
    for objective_function in objective_functions:
        t0 = time.time()
        my_objective = objective_function(parameters)
        x0 = my_objective.x0
        t1 = time.time()
        for i in range(0,num_reps):
            y  = my_objective(x0)    
        t2 = time.time()
        test_setup_times.append(t1-t0)
        test_call_times.append(t2-t1)
        
        del my_objective
        
    setup_times.append(test_setup_times)
    call_times.append(test_call_times)
    print 'Evaluated %i points' % num_pts

# plot the results
plt.figure()
plt.loglog(num_pts_list,call_times,lw=2)
plt.legend(legends,loc='upper left')
plt.xlabel('Number of points')
plt.ylabel('Execution Time [sec]')
plt.title('Times for %i calls to objective function' % num_reps)
plt.grid(True)


#############################################################
# test the performance when used in an optimizer

num_pts_list = [10,100,250,500,1000,2500,5000]
#num_pts_list = [10,100,250]

setup_times = []
call_times  = []
for num_pts in num_pts_list:
    seed(1234567890)      # set the random generator to get repeatable results    
    point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)
    
    parameters = { "point_list" : point_list }
    
    test_setup_times = []
    test_call_times = []
    for objective_function in objective_functions:
        t0 = time.time()
        my_objective = objective_function(parameters)
        x0 = my_objective.x0
        p_vectorized = NLP(my_objective, x0, iprint = -1)
        t1 = time.time()
        opt_result = p_vectorized.solve('ralg')
        t2 = time.time()
        test_setup_times.append(t1-t0)
        test_call_times.append(t2-t1)
        
        print '%s : %s' % (opt_result.xf,objective_function)
        
        del my_objective
        gc.collect()   # force release of resources held by the objective
        
    setup_times.append(test_setup_times)
    call_times.append(test_call_times)
    print 'Evaluated %i points' % num_pts
    print

# plot the results
plt.figure()
plt.loglog(num_pts_list,call_times,lw=2)
plt.legend(legends,loc='upper left')
plt.xlabel('Number of points')
plt.ylabel('Execution Time [sec]')
plt.title('Times to solve the ellipse fit problem')
plt.grid(True)

plt.show()

Testing results:

   Figure 1 shows the difference in time to call the objective function for the various methods of implementing the objective function. From this chart, the worst performing method is to implement the objective function as a scalar. The best performing implementation is to implement the function using OpenCL on the CPU. This variation in performance is not as stark when the objective function is used in the Ellipse Fit problem. In fact, because the ellipse fit problem measures data transfer times and computation times, the ordering of the performance changes. When just measuring the time to evaluate the objective function, the increasing order of performance is scalar, parallelized & vectorized, vectorized, OpenCL on GPU, and the best performance is OpenCL on the CPU. Furthermore, as the problem scale changed, the relative performance ratios stay relatively constant.When the ellipse fit problem is solved, the ordering changes and the relative slopes change. One reason for this is the overhead associated with the optimization routines. Another reason is the increased time which includes time to transfer data into the GPU. Another interesting result is that the execution times do not change much as the problem size increases. This is probably due to the efficiency of the computation and the efficiency of block transfers of data.

openCL fit call times

Figure 2

One other thing to note is that these different methods of computation have different resolutions in the floating point numbers. Because of these differences in the resolution, the optimization problem finds slightly different answers. The console output shows the differences in the results of the optimization. In general, for the larger problems, the difference are about one part in 10,0000.

Console Output:

Evaluated 10 points
Evaluated 100 points
Evaluated 250 points
Evaluated 500 points
Evaluated 1000 points
Evaluated 2500 points
Evaluated 5000 points
Evaluated 10000 points
Evaluated 50000 points
Evaluated 100000 points
Evaluated 1000000 points

 

[ 2.53977781 -1.9791083   1.11795369  1.90068376 -0.97398384] : <class 'objective_scalar.Objective'>
[ 2.53978396 -1.97911311  1.11797418  1.90068553 -0.97399285] : <class 'objective_vectorized.Objective'>
[ 2.53980864 -1.9791325   1.11805566  1.90069183 -0.97402764] : <class 'objective_vectorized_parallel.Objective'>
[ 2.53972748 -1.97906577  1.1177896   1.90067428 -0.9739137 ] : <class '__main__.objective_openCL_GPU_Objective'>
[ 2.53977167 -1.97910332  1.11793354  1.90068234 -0.97397521] : <class '__main__.objective_openCL_CPU_Objective'>
Evaluated 10 points

 

[ 2.566147   -1.95135142  1.0549304   2.16680656 -1.10160299] : <class 'objective_scalar.Objective'>
[ 2.56614697 -1.95134842  1.05492776  2.16681334 -1.10160238] : <class 'objective_vectorized.Objective'>
[ 2.56614691 -1.95134982  1.05492914  2.16680969 -1.1016027 ] : <class 'objective_vectorized_parallel.Objective'>
[ 2.56614661 -1.95134401  1.05492394  2.16682259 -1.10160128] : <class '__main__.objective_openCL_GPU_Objective'>
[ 2.56614672 -1.95134937  1.05492891  2.16680991 -1.10160258] : <class '__main__.objective_openCL_CPU_Objective'>
Evaluated 100 points

 

[ 2.56330383 -2.0896741   1.06288154  2.04214077 -1.05030978] : <class 'objective_scalar.Objective'>
[ 2.5633038  -2.08967399  1.06288158  2.04214065 -1.05031   ] : <class 'objective_vectorized.Objective'>
[ 2.56330231 -2.08967161  1.06288221  2.04213788 -1.05031417] : <class 'objective_vectorized_parallel.Objective'>
[ 2.56330174 -2.08967069  1.0628825   2.04213675 -1.05031584] : <class '__main__.objective_openCL_GPU_Objective'>
[ 2.56330119 -2.08966978  1.0628827   2.04213573 -1.05031742] : <class '__main__.objective_openCL_CPU_Objective'>
Evaluated 250 points

 

[ 2.55213105 -2.06212704  1.05749405  2.02108216 -1.05369719] : <class 'objective_scalar.Objective'>
[ 2.55390449 -2.0623902   1.05823288  2.02404738 -1.05682491] : <class 'objective_vectorized.Objective'>
[ 2.54907192 -2.05769363  1.05956421  2.01575473 -1.05475975] : <class 'objective_vectorized_parallel.Objective'>
[ 2.55212491 -2.06209759  1.05747762  2.02105559 -1.05370312] : <class '__main__.objective_openCL_GPU_Objective'>
[ 2.5518505  -2.06217932  1.05803857  2.02239691 -1.05392966] : <class '__main__.objective_openCL_CPU_Objective'>
Evaluated 500 points

 

[ 2.61796899 -2.14539556  1.11170214  2.12027043 -1.09253186] : <class 'objective_scalar.Objective'>
[ 2.61708437 -2.14479931  1.10835605  2.12087692 -1.090721  ] : <class 'objective_vectorized.Objective'>
[ 2.61761133 -2.14669529  1.1089428   2.12139812 -1.09022015] : <class 'objective_vectorized_parallel.Objective'>
[ 2.61753131 -2.14662822  1.10869732  2.12152618 -1.09010436] : <class '__main__.objective_openCL_GPU_Objective'>
[ 2.61763094 -2.14672541  1.10895434  2.12136288 -1.09017078] : <class '__main__.objective_openCL_CPU_Objective'>
Evaluated 1000 points

 

[ 2.60471429 -2.11563461  1.06998647  2.13374471 -1.0697982 ] : <class 'objective_scalar.Objective'>
[ 2.60416974 -2.11432462  1.07083334  2.13291023 -1.07026969] : <class 'objective_vectorized.Objective'>
[ 2.60416139 -2.1136674   1.0722599   2.13226116 -1.07119697] : <class 'objective_vectorized_parallel.Objective'>
[ 2.60527276 -2.11598152  1.07174066  2.13404717 -1.07158798] : <class '__main__.objective_openCL_GPU_Objective'>
[ 2.60428855 -2.11418258  1.07163536  2.13309981 -1.07148272] : <class '__main__.objective_openCL_CPU_Objective'>
Evaluated 2500 points

 

[ 2.65236805 -2.17113736  1.11832436  2.18090649 -1.10477807] : <class 'objective_scalar.Objective'>
[ 2.65236757 -2.17113597  1.11832359  2.18090679 -1.10477991] : <class 'objective_vectorized.Objective'>
[ 2.65236655 -2.1711387   1.1183202   2.18090519 -1.10477049] : <class 'objective_vectorized_parallel.Objective'>
[ 2.65124389 -2.17148832  1.11786432  2.17945815 -1.10311976] : <class '__main__.objective_openCL_GPU_Objective'>
[ 2.65124383 -2.17148872  1.11786446  2.17945781 -1.10311962] : <class '__main__.objective_openCL_CPU_Objective'>
Evaluated 5000 points

 

 

Conclusions:

  • Using PyOpenCl is significantly faster then Numpy in this problem when using the CPU or the GPU to perform the calculation. The OpenCL version of the objective function solved on the GPU was over 10 times faster than the Numpy version of the objective function. The OpenCL version solved on the CPU was almost 100 times faster then the Numpy version.
  • The OpenCL version of the objective function was able to be redirected from the the GPU to the CPU using a a single line of code.
  • Using the native CPU with OpenCL is significantly faster then using the GPU for this problem. For this problem, the performance difference is related to the fact that the computational load is roughly proportional the quantity of data which is transferred into the device. When the OpenCL program executes on the CPU, the data transfer times are very low. When the OpenCL program runs on the GPU, global memory is used, which requires a data transfer from the CPU’s memory space to the GPU memory space. This is usually relatively slow.
  • Because of the differences in the numeric representation of the floating point values, the OpenCL calculations introduce differences in the calculations compared to the scalar, the Numpy, and parallelized Numpy representations.

 

Hardware Configuration:

  • Pythonxy 2.6.6
  • PyOpenCL 0.92
  • Sony Vaio VPCF1
    • Intel i7  Q740 @ 1.73 GHz
    • NVIDIA GeForce GT 425M @ 1.12 GHz

 

References:

  • Khronos Group OpenCL Standard
  • NVIDIA OpenCL manuals
  • Intel OpenCL manuals
  • PyOpenCL Home Page
  • PyOpenCL Documents
This work is licensed under a Creative Commons Attribution By license.

Saturday, June 11, 2011

Parallelizing Function Calls: An Example Using the Ellipse Fit

Problem Statement

Can Parallel Python improve the performance of the ellipse fit algorithm? Under which conditions will Parallel Python offer performance advantages.

Discussion

Breaking an algorithm into pieces which are executed in parallel on multiple CPU’s can speed up execution time. One way to estimate the best theoretical improvement is to use Amdahl’s law. This law estimates the performance improvement by breaking an algorithm in a portion which can be parallelized and a portion which is serial in nature. This is an upper estimate of the benefits of parallelization.

In practical parallelization, there may be overheads associated with getting things running on multiple CPUs. In Python, there are several issues to consider. The first is that that C implementation of Python does not natively support true parallelization. This is associated with issues deep in the interpreter (search on Python GIL for more information). Therefore, any library that supports parallelization needs to work around the issues with the GIL. Implementations like JPython and IronPython do not suffer from these issues.

One easy to use library is Parallel Python. It allows a program to establish a set of local and remote servers which are passed  functions and all of the information for successfully call those functions. The relative ease of use is offset by the fact that when the server’s are setup, there is a time cost. Also, then passing the functions, parameters, and everything else, there is a time cost. The experiments here looked at the use of this library in the ellipse fitting problem and compared the execution time to other solutions.

Testing

To test the use of Parallel Python, the objective class previously developed was reused. An instance of this class behaves like a function by supporting the __call__ method. More importantly, the __init__ method and the __del__ method are overridden to create and destroy the Parallel Python job servers. To use these scripts, install them in the same directory as the scripts from here.

The first script implements the objective function, parallelizes it using Parallel Python. All of the calculations are performed using vectorized math. The objective function is implemented in the Objective class. In this class, when __init__ is called, the parameters used by the objective function are stored with the class instance, the number of parallel processes for execution are determined, and the Parallel Python jobs server is started. This causes a new instance of Python to be started for each process which will be used. When the __call__ method is invoke by a call to the class instance, then the calculation is broken up into pieces and dispatched to the job servers. When the objective function is no longer needed and the garbage collector invokes the __del__ method, the servers are destroyed.

objective_vectorized_parallel.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations which are
   parallelized using parallel python.

'''

from numpy import *
from numpy import linalg as LA
import objective_scalar
import pp
    
class Objective(objective_scalar.Objective):
    
    def __init__(self,parameters):
        objective_scalar.Objective.__init__(self,parameters)
        self.ncpus = parameters.get('ncpus','autodetect')
        self.job_server = pp.Server(self.ncpus)
        # because autodetect may have been used, use get_ncpus to 
        # get physical number of cpus
        self.ncpus = self.job_server.get_ncpus()
        self.job_server.set_ncpus(ncpus=self.ncpus)

    def __call__(self,x):
        '''
        Calculate the objective cost in the optimization problem using
           vectorized equations.
        '''
        
        point_list = self._p
        foci1 = array([x[1],x[2]])
        foci2 = array([x[3],x[4]])
        a     = x[0]
        n = float(len(point_list))
        _lambda = 0.1
        
        def solve_sub_problem(point_list,foci1,foci2,a):
        
            pt_diff1 = point_list - foci1
            pt_diff2 = point_list - foci2
         
            x_f1_diff = pt_diff1[:,0]
            x_f2_diff = pt_diff2[:,0]
            y_f1_diff = pt_diff1[:,1]
            y_f2_diff = pt_diff2[:,1]
         
            x_f1_diff_sq = numpy.power(x_f1_diff,2)   
            x_f2_diff_sq = numpy.power(x_f2_diff,2)   
            y_f1_diff_sq = numpy.power(y_f1_diff,2)   
            y_f2_diff_sq = numpy.power(y_f2_diff,2)
            
            norm_pt_to_f1 = numpy.power(x_f1_diff_sq+y_f1_diff_sq,0.5)
            norm_pt_to_f2 = numpy.power(x_f2_diff_sq+y_f2_diff_sq,0.5)
            
            temp = numpy.power(norm_pt_to_f1+norm_pt_to_f2-2*a,2)
            part_sum = numpy.sum(temp)
            return part_sum
    
        jobs = []
        numpts = n
        sigma    = self._sigma
        ahat_max = self._ahat_max
        inc = math.ceil(n/float(self.ncpus))
        endi = 0
        for i in range(0,self.ncpus):
            starti = endi
            endi = int(min(starti+inc,n))
            # make a copy of point list which is smaller
            #   to minimize the time in transferring to the
            #   parallel processes
            local_point_list = array(point_list)[starti:endi,:]
            jobs.append(self.job_server.submit(solve_sub_problem,
                (local_point_list,foci1,foci2,a),
                (),("numpy",)))
        total = sum([job() for job in jobs])/n    
        total += _lambda*ahat_max*sigma*exp((a/ahat_max)**4)
        
        return total
        
    def __del__(self):
        self.job_server.destroy()

if __name__=='__main__':
    import time
    from random import seed

    # local modules
    import ellipse
    
    ####################################################################

    # setup test conditions
    num_reps = 100    
    num_pts = 256
    precision = 'float'
    seed(1234567890)      # set the random generator to get repeatable results
    
    # setup the reference ellipse
    
    # define the foci locations
    foci1_ref = array([2,-1])
    foci2_ref = array([-2,1])
    # define distance from foci to ellipse circumference
    a_ref = 2.5
    
    point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)
    
    parameters = { "point_list" : point_list ,
                   "ncpus"      : 'autodetect'}


    # test the function
    t0 = time.time()
    my_objective = Objective(parameters)
    x0 = my_objective.x0
    t1 = time.time()
    for i in range(0,num_reps):
        y  = my_objective(x0)    
    t2 = time.time()
    
    print ''
    print 'Initialization took %f sec' % (t1-t0)
    print 'Using %i cpus' % my_objective.ncpus
    print 'Execution took %f sec' % (t2-t1)
    print 'Executed %i times.' % (num_reps)
    print ''

    ref_objective = objective_scalar.Objective(parameters)    
    # compare x0 calculation
    print ''
    print ('Difference between x0 calculations = %f' 
            % LA.norm(array(ref_objective.x0)-array(my_objective.x0)))
    print ('Difference between objective calcs = %f' 
            % (ref_objective(x0)-my_objective(x0)))
    print ''

 

The second script measures the execution times for this new objective implementation versus the previous scalar and vectorized implementations. One important item in this script is forcing the garbage collector to run after each test. Without doing this, when new classes are created, additional Python instances are created while the old instances are left running. This could have been solved through a more clever class design. However, for this testing, the simplest solution was to force the garbage collector to handle the issue.

 

Results

The first test measured the time to call the objective function once. This showed that the scalar solution was the slowest and the simple vectorized solution was the fastest. Surprisingly, parallelizing the problem (and using vectorization) resulted in a solution which was consistently between the scalar and vectorized solution.

 

vectorized parallel objective calls

The second test embedded the parallelized version of the objective function in the ellipse fit problem. In this implementation, the results were mixed. For small problems, the scalar solution performed best. As the number of points on the ellipse increased, the vectorized version provided the best results. However, the parallelized version appeared to converge towards the vectorized version for larger problems.

vectorized parallelized fit times

 

From this test, the conclusion was that parallelization using an approach like Parallel Python is not effective. One of the reasons for this is the large amount of data transfer that needs to happen to setup the function call. In this example, the ratio of computation to data transfer is low enough that the transfer mechanisms make the benefits of parallelization, for a system with 8 cpus, not worth the effort. If the data transfer were faster or the computation times were longer, then parallelizing using Parallel Python might have offered advantages.

 

Test Conditions:

Further Testing/Development

  • Evaluate the impact of imported modules
  • Evaluate ways to share read only resources among processes efficiently

References

This work is licensed under a Creative Commons Attribution By license.

Thursday, March 10, 2011

A Python Script to Fit an Ellipse to Noisy Data

ExampleEllipse

Problem statement

Given a set of noisy data which represents noisy samples from the perimeter of an ellipse, estimate the parameters which describe the underlying ellipse.

Discussion

There are two general ways to fit an ellipse: algebraic and geometric approaches. In an algebraic approach, the parameters for an algebraic description of an ellipse are fit subject to constraints which guarantee the parameters result in an ellipse. In the geometric approach,  characteristics of the ellipse are fit.

The code snippet below uses a method described by Yu, Kulkarni & Poor. The location of the foci and the length of the line segments from the foci to a point on the perimeter of the ellipse are found through an optimization problem. Because the fitting objective is not convex and has a minimum at infinity, a penalty cost is added to prevent the foci from wandering off.

Code

'''
Script to fit an ellipse to a set of points.
- The ellipse is represented by the two foci and the length of a 
     line segment which is drawn from the foci to the 
     point where the ellipse intersects the minor axis.
    
- Fitting algorithm from Yu, Kulkarni & Poor

'''

__author__ = 'Ed Tate'
__email__  = 'edtategmail-dot-com'
__website__ = 'exnumerus.blogspot.com'
__license__ = 'Creative Commons Attribute By - http://creativecommons.org/licenses/by/3.0/us/'''

####################################################
# create ellipse with random noise in points
from random import uniform,normalvariate
from math import pi, sin, cos, exp, pi, sqrt
from openopt import NLP
from numpy import *
from numpy import linalg as LA
import matplotlib.pylab as pp

def gen_ellipse_pts(a,foci1,foci2,
                    num_pts=200, angles=None,
                    x_noise = None, y_noise=None):

    '''
       Generate points for an ellipse given
          the foci, and
          the distance to the intersection of the minor axis and ellipse.
    
       Optionally, 
          the number of points can be specified,
          the angles for the points wrt to the centroid of the ellipse, and 
          a noise offset for each point in the x and y axis.
    '''
    c = (1/2.0)*LA.norm(foci1-foci2)
    b = sqrt(a**2-c**2)
    x1 = foci1[0]
    y1 = foci1[1]
    x2 = foci2[0]
    y2 = foci2[1]
    if angles is None:
        t = arange(0,2*pi,2*pi/float(num_pts))
    else:
        t = array(angles)
            
    ellipse_x = (x1+x2)/2 +(x2-x1)/(2*c)*a*cos(t) - (y2-y1)/(2*c)*b*sin(t)
    ellipse_y = (y1+y2)/2 +(y2-y1)/(2*c)*a*cos(t) + (x2-x1)/(2*c)*b*sin(t)
    try:
        # try adding noise to the ellipse points
        ellipse_x = ellipse_x + x_noise
        ellipse_y = ellipse_y + y_noise
    except TypeError:
        pass
    return (ellipse_x,ellipse_y)

####################################################################

# setup the reference ellipse

# define the foci locations
foci1_ref = array([2,-1])
foci2_ref = array([-2,1])
# pick distance from foci to ellipse
a_ref = 2.5

# generate points for reference ellipse without noise
ref_ellipse_x,ref_ellipse_y = gen_ellipse_pts(a_ref,foci1_ref,foci2_ref)

# generate list of noisy samples on the ellipse
num_samples = 1000
angles = [uniform(-pi,pi) for i in range(0,num_samples)]
sigma = 0.2
x_noise = [normalvariate(0,sigma) for t in angles]
y_noise = [normalvariate(0,sigma) for t in angles]
x_list,y_list = gen_ellipse_pts(a_ref,foci1_ref,foci2_ref,
                                angles  = angles,
                                x_noise = x_noise,
                                y_noise = y_noise)

point_list = []
for x,y in zip(x_list,y_list):
    point_list.append(array([x,y]))    

# draw the reference ellipse and the noisy samples    
pp.figure()
pp.plot(x_list,y_list,'.b', alpha=0.5)
pp.plot(ref_ellipse_x,ref_ellipse_y,'g',lw=2)
pp.plot(foci1_ref[0],foci1_ref[1],'o')
pp.plot(foci2_ref[0],foci2_ref[1],'o')

#####################################################

def initialize():
    '''
    Determine the initial value for the optimization problem.
    '''
    # find x mean
    x_mean = array(x_list).mean()
    # find y mean
    y_mean = array(y_list).mean()
    # find point farthest away from mean
    points = array(zip(x_list,y_list))
    center = array([x_mean,y_mean])
    distances = zeros((len(x_list),1))
    for i,point in enumerate(points):
        distances[i,0]=LA.norm(point-center)
    ind = where(distances==distances.max())
    max_pt = points[ind[0],:][0]
    # find point between mean and max point
    foci1 = (max_pt+center)/2.0
    # find point opposite from 
    foci2 = 2*center - max_pt
    return [distances.max(), foci1[0],foci1[1],foci2[0],foci2[1]]


def objective(x):
    '''
    Calculate the objective cost in the optimization problem.
    '''
    foci1 = array([x[1],x[2]])
    foci2 = array([x[3],x[4]])
    a     = x[0]
    n = float(len(point_list))
    _lambda =0.1
    _sigma = sigma
    sum = 0
    for point in point_list:
        sum += ((LA.norm(point-foci1,2)+LA.norm(point-foci2,2)-2*a)**2)/n
    sum += _lambda*ahat_max*_sigma*exp((a/ahat_max)**4)
    return sum

# solve the optimization problem
x0 = initialize()
ahat_max = x0[0]
print x0
p = NLP(objective, x0)
r = p.solve('ralg')
print r.xf

# get the results from the optimization problem
xf = r.xf
# unload the specific values from the result vector
foci1 = array([xf[1],xf[2]])
foci2 = array([xf[3],xf[4]])
a     = xf[0]

# reverse the order of the foci to get closest to ref foci
if LA.norm(foci1-foci1_ref)>LA.norm(foci1-foci2_ref):
    _temp = foci1
    foci1 = foci2
    foci2 = _temp

####################################################
# plot the fitted ellipse foci
pp.plot([foci1[0]],[foci1[1]],'xk')
pp.plot([foci2[0]],[foci2[1]],'xk')

# plot a line between the fitted ellipse foci and the reference foci
pp.plot([foci1[0],foci1_ref[0]],[foci1[1],foci1_ref[1]],'m-')
pp.plot([foci2[0],foci2_ref[0]],[foci2[1],foci2_ref[1]],'m-')

# plot fitted ellipse
(ellipse_x,ellipse_y) = gen_ellipse_pts(a,foci1,foci2,num_pts=1000)  
pp.plot(ellipse_x,ellipse_y,'r-',lw=3,alpha=0.5)

# scale the axes for a square display
x_max = max(x_list)
x_min = min(x_list)
y_max = max(y_list)
y_min = min(y_list)

box_max = max([x_max,y_max])
box_min = min([x_min,y_min])
pp.axis([box_min, box_max, box_min, box_max])

pp.show()


References


Testing Configuration

This work is licensed under a Creative Commons Attribution By license.