Showing posts with label Automation. Show all posts
Showing posts with label Automation. Show all posts

Friday, February 24, 2012

Lesson Learned: Managing Large Numbers of Plots, with an Example in Python

28056200-5ef7-11e1-8741-08002700a056

Problem Statement:

A project generates hundreds, thousands, or more graphs over its life. These graphs are copied and pasted into e-mail, power point slides, etc. The plots become divorced from any of the documents they were originally distributed with. Invariably, at some point in the project, a plot is brought back and the question is what were the assumptions used to generate this graph. With only the graph available, it can be difficult or impossible to answer this question.

To complicate matters, the plots are generated using legacy codes and modifying all of the existing code base is a detailed endeavor.

How can this situation be improved?

Discussion:

There are two problems here. First, a given graph is not traceable to its origin. This can be remedied in one of many ways.  If the source data is well controlled and can be described using a short phase, then adding that phrase somewhere on the chart is helpful. If the source data is constantly changing or requires too much information to describe with a short phrase, then something else is needed. A hash of the input data can help identify and verify the source data set used to generate the graph. A universally unique ID (UUID) can be used to give a graph a unique name. If the source data, assumptions, etc are stored using that same UUID, then when a graph if brought back for review, all of the necessary parts can be found.

The second problem is handling legacy code. There are at least three choices.

  • The first is probably the easiest. A function to add hash and uuid could be created and inserted at the appropriate location in each of the major pieces of plotting code. This is problematic because there are several interfaces and actions in the scripts which could make this work poorly. Also, every plotting routine would need to be modified.
  • A second choice is to wrap the plotting routines into function, then pass this function and its data to a wrapper function which would add a hash and uuid as the last thing done by the plotting routine. If the plotting functions already exist, then this can be done without changing any of the plotting code.   
  • A third choice is to create a decorator which wraps plotting routines, adding the hash and uuid. This has the same issues as using a wrapper call and requires changes to the plotting routines source. However, the changes consist of an import statement and application of the decorator at the correct location.

For this problem, the decorator solution is used. The code that following implements a decorator that creates a hash of the plot data and a UUID which are added to the right side of the plot.  This way, no matter where the plot goes, there is a high likelihood that its pedigree can be preserved.

'''
Script to demonstrate the use of decorators to add a unique identifier to
   a plot. The identifier incudes a hash of input data, to help see if 
   version of a plot really have different data, and a UUID to uniquely 
   identify this plot independent of when or where it was generated.
'''

__author__  = 'Ed Tate'
__email__   = 'edtate<AT>gmail-dot-com'
__website__ = 'exnumerus.blogspot.com'
__license__ = 'Creative Commons Attribute By - http://creativecommons.org/licenses/by/3.0/us/'

from matplotlib.pylab import *
import random
import md5
import uuid

def identifiable_plot(fn):
    def newfun(*args,**kwargs):
        # do before decorated function call
        fn(*args,**kwargs)
        # do after decorated function call
        # create the tag string from a hash of the data and a 
        #    universially unique ID
        x = args[0]
        y = args[1]
        xy = zip(x,y)
        m = md5.new()
        m.update(str(xy))
        this_uuid = str(uuid.uuid1())
        this_tag = 'hash=' + m.hexdigest() + ',' + 'UUID=' + this_uuid
        # write the tag to the figure for future reference
        figtext(1,0.5,this_tag ,rotation='vertical',
                horizontalalignment='right',
                verticalalignment='center',
                size = 'x-small',
                )
        return this_uuid

    return newfun

###############################
    
@identifiable_plot
def my_plot(x,y):
    plot(x,y,'o')
    grid(True)
    
###############################
    
x = [random.random() for i in range(100)]
y = [random.random() for i in range(100)]
    
plot_uuid = my_plot(x,y)
savefig(plot_uuid+'.png')

show()

 


Test Configuration:
  • win7
  • PythonXY 2.7.2.1

References:
This work is licensed under a Creative Commons Attribution By license.

Using HTML to View Large Sets of Plots - An Example in Python



This example doesn't work because of Blogger limitations. However if you run the example you will be able to select graphs in the generated HTML page.

Problem Statement

You have a program which generates lots of similar plots that end users would like to compare and explore. The end users may not be able to install any code. You can't setup a web server to nagivate the data set. You can not install any new programs on their window's desktop. How to you provide a solution?

Discussion

You can assume that any modern computer at least has a copy of Firefox, Safari, or Explorer. Since there browers support javascript (except in the worst case security settings), you can build a very lightweight data viewer using a few simple methods. The most important design decisions when generating the plots is to name the plots so they are easy to recreate from selections an end user might make.

Example

The following snippet of Python code generates 9 graphs that have random numbers plotted on two axes using different colors and markers. There are three choices of colors and three choices of markers. After generating the plots and saving them, the script creates an HTML file which simplies the navigation of the images. A user can open the HTML page and select the graph by changing the form selections at the top of the page.

There are a couple of key concepts that help make this work:
  • The plot names can be created from selections using javascript. For example, in this example there are three colors and three different markers. All of the plot file names are formed by concatenating the color and marker description to form a plot name.
  • When a user changes their choice of color or marker a javascript function rebuilds the plot file name and causes the browser to reload the image by change the img source.
  • The python script uses templates to set up the bulk of the HTML page, then substitutes in specific options for the user after the plots have been generated. 

import pylab as plt
from random import random
from string import Template

colors  = {'Red_Plot':'r',
           'Blue_Plot':'b',
           'Green_Plot':'g',
           }
markers = {'Circle_Plot':'o',
           'Square_Plot':'s',
           'Diamond_Plot':'d',
           } 

plt.figure()
for c_key in colors.keys():
    for m_key in markers.keys():
        plt.clf()
        plot_name = c_key + ',' + m_key + '.png'
        x = [random() for i in range(0,100)]
        y = [random() for i in range(0,100)]
        color = colors[c_key]
        marker = markers[m_key]
        plt.plot(x,y,color+marker,markersize=15)
        plt.savefig(plot_name)
        
HTML_template = Template('''<head>
   <script language="JavaScript"><!--
      function sel_plot() {
         // only do this if the brower supports images
         if(document.images) {
            // get plot color name
            var e=document.getElementById("color_name");
            var c_name = e.options[e.selectedIndex].text;
            // get plot marker name
            var e=document.getElementById("marker_name");
            var m_name = e.options[e.selectedIndex].text; // create the filename
            // build the filename from these selections
            var plot_filename = c_name + "," + m_name + ".png";
            // cause the correct plot to be loaded
            document["plot"].src = plot_filename;
            }
         }
      // select the plot to display initially after loading document
      window.onload = function() { sel_plot() };
      // silence errors
      window.onerr = null;
   </script>
</head>
<body>
   <center>
      <form name="Plot Select Form" id="plot_select_form">
         <select id="color_name" size="3"
            onchange="sel_plot()">
            $color_options
         </select>
         <select id="marker_name" size="3"
            onchange="sel_plot()">
            $marker_options
         </select>
      </form>
      <img name="plot" src="dummy.png" height="200"
         width="500">
   </center>
</body>
''')
        

# build the option strings
def build_options(opt_dict):
    s = ''
    for i,key in enumerate(opt_dict.keys()):
        if i==1:
            s += '<option selected>'
        else:
            s += '<option>'
        s += key
        s += '</option>\n'
    return s    

color_options = build_options(colors)
marker_options = build_options(markers)

# build the HTML page        
HTML = HTML_template.substitute(color_options=color_options,
                     marker_options=marker_options)
    

# write the HTML page
f = open('example.html','wb')
f.write(HTML)
f.close()

Test Configuration

  • PythonXY 2.7.2.1
  • IE 9


This work is licensed under a Creative Commons Attribution By license.

Saturday, June 11, 2011

Parallelizing Function Calls: An Example Using the Ellipse Fit

Problem Statement

Can Parallel Python improve the performance of the ellipse fit algorithm? Under which conditions will Parallel Python offer performance advantages.

Discussion

Breaking an algorithm into pieces which are executed in parallel on multiple CPU’s can speed up execution time. One way to estimate the best theoretical improvement is to use Amdahl’s law. This law estimates the performance improvement by breaking an algorithm in a portion which can be parallelized and a portion which is serial in nature. This is an upper estimate of the benefits of parallelization.

In practical parallelization, there may be overheads associated with getting things running on multiple CPUs. In Python, there are several issues to consider. The first is that that C implementation of Python does not natively support true parallelization. This is associated with issues deep in the interpreter (search on Python GIL for more information). Therefore, any library that supports parallelization needs to work around the issues with the GIL. Implementations like JPython and IronPython do not suffer from these issues.

One easy to use library is Parallel Python. It allows a program to establish a set of local and remote servers which are passed  functions and all of the information for successfully call those functions. The relative ease of use is offset by the fact that when the server’s are setup, there is a time cost. Also, then passing the functions, parameters, and everything else, there is a time cost. The experiments here looked at the use of this library in the ellipse fitting problem and compared the execution time to other solutions.

Testing

To test the use of Parallel Python, the objective class previously developed was reused. An instance of this class behaves like a function by supporting the __call__ method. More importantly, the __init__ method and the __del__ method are overridden to create and destroy the Parallel Python job servers. To use these scripts, install them in the same directory as the scripts from here.

The first script implements the objective function, parallelizes it using Parallel Python. All of the calculations are performed using vectorized math. The objective function is implemented in the Objective class. In this class, when __init__ is called, the parameters used by the objective function are stored with the class instance, the number of parallel processes for execution are determined, and the Parallel Python jobs server is started. This causes a new instance of Python to be started for each process which will be used. When the __call__ method is invoke by a call to the class instance, then the calculation is broken up into pieces and dispatched to the job servers. When the objective function is no longer needed and the garbage collector invokes the __del__ method, the servers are destroyed.

objective_vectorized_parallel.py

'''
This module contains an objective function for the ellipse fitting problem.
The objective is coded using vectorized operations which are
   parallelized using parallel python.

'''

from numpy import *
from numpy import linalg as LA
import objective_scalar
import pp
    
class Objective(objective_scalar.Objective):
    
    def __init__(self,parameters):
        objective_scalar.Objective.__init__(self,parameters)
        self.ncpus = parameters.get('ncpus','autodetect')
        self.job_server = pp.Server(self.ncpus)
        # because autodetect may have been used, use get_ncpus to 
        # get physical number of cpus
        self.ncpus = self.job_server.get_ncpus()
        self.job_server.set_ncpus(ncpus=self.ncpus)

    def __call__(self,x):
        '''
        Calculate the objective cost in the optimization problem using
           vectorized equations.
        '''
        
        point_list = self._p
        foci1 = array([x[1],x[2]])
        foci2 = array([x[3],x[4]])
        a     = x[0]
        n = float(len(point_list))
        _lambda = 0.1
        
        def solve_sub_problem(point_list,foci1,foci2,a):
        
            pt_diff1 = point_list - foci1
            pt_diff2 = point_list - foci2
         
            x_f1_diff = pt_diff1[:,0]
            x_f2_diff = pt_diff2[:,0]
            y_f1_diff = pt_diff1[:,1]
            y_f2_diff = pt_diff2[:,1]
         
            x_f1_diff_sq = numpy.power(x_f1_diff,2)   
            x_f2_diff_sq = numpy.power(x_f2_diff,2)   
            y_f1_diff_sq = numpy.power(y_f1_diff,2)   
            y_f2_diff_sq = numpy.power(y_f2_diff,2)
            
            norm_pt_to_f1 = numpy.power(x_f1_diff_sq+y_f1_diff_sq,0.5)
            norm_pt_to_f2 = numpy.power(x_f2_diff_sq+y_f2_diff_sq,0.5)
            
            temp = numpy.power(norm_pt_to_f1+norm_pt_to_f2-2*a,2)
            part_sum = numpy.sum(temp)
            return part_sum
    
        jobs = []
        numpts = n
        sigma    = self._sigma
        ahat_max = self._ahat_max
        inc = math.ceil(n/float(self.ncpus))
        endi = 0
        for i in range(0,self.ncpus):
            starti = endi
            endi = int(min(starti+inc,n))
            # make a copy of point list which is smaller
            #   to minimize the time in transferring to the
            #   parallel processes
            local_point_list = array(point_list)[starti:endi,:]
            jobs.append(self.job_server.submit(solve_sub_problem,
                (local_point_list,foci1,foci2,a),
                (),("numpy",)))
        total = sum([job() for job in jobs])/n    
        total += _lambda*ahat_max*sigma*exp((a/ahat_max)**4)
        
        return total
        
    def __del__(self):
        self.job_server.destroy()

if __name__=='__main__':
    import time
    from random import seed

    # local modules
    import ellipse
    
    ####################################################################

    # setup test conditions
    num_reps = 100    
    num_pts = 256
    precision = 'float'
    seed(1234567890)      # set the random generator to get repeatable results
    
    # setup the reference ellipse
    
    # define the foci locations
    foci1_ref = array([2,-1])
    foci2_ref = array([-2,1])
    # define distance from foci to ellipse circumference
    a_ref = 2.5
    
    point_list = ellipse.generate_point_list(num_pts,a_ref,foci1_ref,foci2_ref)
    
    parameters = { "point_list" : point_list ,
                   "ncpus"      : 'autodetect'}


    # test the function
    t0 = time.time()
    my_objective = Objective(parameters)
    x0 = my_objective.x0
    t1 = time.time()
    for i in range(0,num_reps):
        y  = my_objective(x0)    
    t2 = time.time()
    
    print ''
    print 'Initialization took %f sec' % (t1-t0)
    print 'Using %i cpus' % my_objective.ncpus
    print 'Execution took %f sec' % (t2-t1)
    print 'Executed %i times.' % (num_reps)
    print ''

    ref_objective = objective_scalar.Objective(parameters)    
    # compare x0 calculation
    print ''
    print ('Difference between x0 calculations = %f' 
            % LA.norm(array(ref_objective.x0)-array(my_objective.x0)))
    print ('Difference between objective calcs = %f' 
            % (ref_objective(x0)-my_objective(x0)))
    print ''

 

The second script measures the execution times for this new objective implementation versus the previous scalar and vectorized implementations. One important item in this script is forcing the garbage collector to run after each test. Without doing this, when new classes are created, additional Python instances are created while the old instances are left running. This could have been solved through a more clever class design. However, for this testing, the simplest solution was to force the garbage collector to handle the issue.

 

Results

The first test measured the time to call the objective function once. This showed that the scalar solution was the slowest and the simple vectorized solution was the fastest. Surprisingly, parallelizing the problem (and using vectorization) resulted in a solution which was consistently between the scalar and vectorized solution.

 

vectorized parallel objective calls

The second test embedded the parallelized version of the objective function in the ellipse fit problem. In this implementation, the results were mixed. For small problems, the scalar solution performed best. As the number of points on the ellipse increased, the vectorized version provided the best results. However, the parallelized version appeared to converge towards the vectorized version for larger problems.

vectorized parallelized fit times

 

From this test, the conclusion was that parallelization using an approach like Parallel Python is not effective. One of the reasons for this is the large amount of data transfer that needs to happen to setup the function call. In this example, the ratio of computation to data transfer is low enough that the transfer mechanisms make the benefits of parallelization, for a system with 8 cpus, not worth the effort. If the data transfer were faster or the computation times were longer, then parallelizing using Parallel Python might have offered advantages.

 

Test Conditions:

Further Testing/Development

  • Evaluate the impact of imported modules
  • Evaluate ways to share read only resources among processes efficiently

References

This work is licensed under a Creative Commons Attribution By license.

Saturday, July 31, 2010

How to start Open Office (with UNO services) from an external copy of Python

Problem Statement:

Demonstrate how to start Open Office from an external installation of Python so the pyuno library can be used to talk to the UNO interfaces in Open Office.

Target Application:

The purpose of this code snippet is to test a feature which automates Open Office from an external copy of Python (not the one shipped with Open Office).

Discussion:

There are several ways to start a process external to Python. One of the easiest is to use the  os module’s system function. This allows a command to be passed to a shell, just like it was typed at the command line. This function starts a shell and passes a string to that shell, then return execution to python once the new shell terminates.

To start Open Office so that pyuno can automate it, some command line arguments must be passed.  The following snippet illustrates how to use os.system(…) to start Open Office so pyuno can talk to it. While os.system is an easy way to start a process, it blocks execution until Open Office is closed.

import os
workingDir = 'C:\\Program Files (x86)\\OpenOffice.org 3\\program'
cmd = "\"\"" + workingDir + '\\soffice\" '+'"-accept=socket,host=localhost,port=8100;urp;StarOffice.NamingService"'
# blocking call 
os.system(cmd) 
# this call block execution until open office terminated 
# do something after Open Office closes 

To allow a Python script to continue to execute while Open Office is running, the subprocess module offers more powerful ways to launch and control the Open Office process. The following code snippet illustrates how to start Open Office as a listener for UNO calls using subprocess.Popen(…) .

import subprocess
workingDir = 'C:\\Program Files (x86)\\OpenOffice.org 3\\program'
cmd  = workingDir + '\\soffice.exe'
opts = '"-accept=socket,host=localhost,port=8100;urp;StarOffice.NamingService"'
OOprocess = subprocess.Popen([cmd,opts])   # this call allows continued execution
# do more stuff while Open Office runs

An important difference between using subprocess.Popen(…) and os.system(…) is that subprocess.Popen(…) eliminates the need to add double quotes around path names with spaces in Windows. In fact, if you add the double quotes you will get “WindowsError: [Error 5] Access is denied” because the path and file name are not properly interpreted.

 Test Environment:

  • PythonXY 2.6.5.1
  • Open Office 3.2.0
  • Windows 7

References:

Saturday, June 5, 2010

How to launch and terminate Matlab from Python

The recommended way to automate Matlab is to use COM API. However, there are situations where COM is not desired. In this case, the following Python script will launch Matlab, then find the process ID to provide the ability to terminate the process if needed. The reason for this is because occasionally when Matlab is launched from the command line, it shells a new process which is independent of the launching shell. Therefore, to monitor and terminate the Matlab process (say if a script goes awry), the Matlab process and its ID need to be found in the windows process list. Once the process ID is known, it can be treated as any other process by Python.

import wmi
c = wmi.WMI()
import os

os.popen('matlab')

print ‘Matlab launched’
raw_input('Press Enter to continue...')
print ‘Looking for Matlab in process list’
for process in c.Win32_Process(name='matlab.exe'):
    print process.ProcessId, process.Name
    found = True
    if found:
        raw_input('Press Enter to terminate Matlab...')
        process.Terminate()
    else:
        print '!!! Matlab not found...'

 

This code was tested with Matlab 2007b and Python 2.5.1 with PyWin32 and wmi on windows XP.