Monday, October 27, 2014

Getting version info for Python modules (for use in a super-installer)

The application that I am developing uses a lot of different 3rd-party Python modules. The installer (a Windows program) for this application installs these Python modules if they don't already exist on the target system. So the first thing the installer does is find out what version (if any) of these Python modules are currently installed. Below I refer to the application installer as a "super-installer" since it invokes the installers for these 3rd-party modules.

The semi-standard way for a Python module to provide version info is to have an attribute '__version__'.  For example:
>>> import numpy
>>> numpy.__version__
'1.8.2'
But some modules have an attribute 'version'.  For example:
>>> import tornado
>>> tornado.version
'3.1.1'
And some other modules have 'version' as a sub-attribute of 'release'. For example:
>>> import pyreadline
>>> pyreadline.release.version
'1.7.1'
Unfortunately, a few 3rd-party modules don't provide any runtime-queryable version info at all. For those, my super-installer writes the version info into a file in the module's folder after the module's installer has finished. The file name used for this was chosen to be a non-standard one (including the name of the company) so that it won't collide with any version mechanism introduced in future Python or module revisions. The code for reading the version of a Python module looks like this:
def getPythonModuleVersion(moduleName):
    """
    Return a string with the version info for the specified
    Python module.
    Return None if the module does not exist (import fails).
    Return 'UnknownVersion' if no version info is available.
    """
    try:
        moduleObj = importlib.import_module(moduleName)
        if moduleObj is None:
            return None
    except Exception:
        return None

    versionStr = getattr(moduleObj, '__version__', None)
    if versionStr is not None:
        return versionStr

    # '.version' works for 'tornado'
    versionStr = getattr(moduleObj, 'version', None)
    if versionStr is not None:
        return versionStr

    # '.release.version' works for 'pyreadline'
    releaseObj = getattr(moduleObj, 'release', None)
    if releaseObj is not None:
        versionStr = getattr(releaseObj, 'version', None)
        if versionStr is not None:
            return versionStr

    # Final try: see if there is a version file in the folder
    # of the module (this is not a standard mechanism but our
    # installer creates such a file if there is no other
    # version info.)
    versionFilePath = getVersionFilePath(moduleObj)
    if (versionFilePath is not None
        and os.path.isfile(versionFilePath)):
        contents = readContentsOfFile(versionFilePath)
        versionStr = contents.strip()
        if versionStr != '':
            return versionStr

    return 'UnknownVersion'
The function 'getVersionFilePath' looks like this:
def getVersionFilePath(moduleObj):
    """
    Return the path to the version file
    (in the module's folder).
    Note: this file does not necessarily exist.
    Return None if can't get the module's folder.
    """
    moduleFilePath = getattr(moduleObj, '__file__', None)
    if moduleFilePath is None:
        return None
    moduleFolderPath = os.path.dirname(moduleFilePath)
    # VersionFileName is a name chosen to not collide
    versionFilePath = os.path.join(moduleFolderPath,
                                   VersionFileName)
    return versionFilePath
The Python script (run from the super-installer) that checks if a particular Python module is installed and gets the version of the currently installed module looks like this:
    if len(sys.argv) < 2:
        sys.exit("Must supply the name of a Python module")
    moduleName = sys.argv[1]
    versionStrToAddIfNone = None
    if len(sys.argv) > 2:
        versionStrToAddIfNone = sys.argv[2]

    versionStr = getPythonModuleVersion(moduleName)

    # If 'versionStrToAddIfNone' is not None,
    # handle the case where module supplies no version info:
    if (versionStr == 'UnknownVersion'
        and versionStrToAddIfNone is not None):
        status = addVersionFile(moduleName,
                                versionStrToAddIfNone)
        if status:
            versionStr = versionStrToAddIfNone

    if versionStr is not None:
        print versionStr
    else:
        print "NoSuchModule"
The super-installer runs the above script with just one command-line argument (the module name) to find out the currently installed version of each 3rd-party Python module. If the current version is older than what is needed by the application, the installer for the new version of the Python module is run, and then the above script is run with a second command-line argument specifying the version of the module that was just installed. If the module didn't supply any version info, a version file is created via the function 'addVersionFile' which looks like this:
def addVersionFile(moduleName, versionStr):
    """
    Create a version file in the folder of the module.
    Return True iff successful.
    """
    try:
        moduleObj = importlib.import_module(moduleName)
        if moduleObj is None:
            return False
    except Exception:
        return False
    versionFilePath = getVersionFilePath(moduleObj)
    if versionFilePath is not None:
        writeStringToFile(versionFilePath, versionStr)
        return True
    return False
By the way, the use of 'pkg_resources' to obtain version info is problematic if binary Windows installers have been used to install a Python module. For example, look at this:
>>> import pkg_resources
>>> pkg_resources.get_distribution('numpy').version
'1.7.1'
>>> import numpy
>>> numpy.__version__
'1.8.2'

Sunday, October 16, 2011

Getting Process Info from iOS 5's LowMemory Diagnostic Reports

In iOS 5 (the latest version of Apple's OS for iPhones, iPads, etc), you can look at the diagnostic reports that it sends (if you have okayed this) to Apple via the "Settings" app.
The diagnostic reports are in the section:
General >> About >> Diagnostics & Usage >> Diagnostics & Usage Data

I wrote a Python script that extracts info from the LowMemory reports. It prints out the names of the processes and the amount (in MB) of memory that each was using.
Here's the script: processLowMemReport.py
(It is published under the MIT licence.)

Sample output (from my iPhone 3GS):

1PasswordPro7.230
AppleIDAuthAgent0.996
BTServer1.023
CommCenterClassi2.082
Medieval_iphone_36.859
MobileMail11.609
MobilePhone4.941
Music~iphone10.059
Pastebot4.453
Preferences3.215
SCHelper0.633
SpringBoard29.078
Tweetbot7.816
UserEventAgent1.664
absinthed.N880.531
accountsd1.043
aggregated1.137
aosnotifyd1.918
apsd0.965
awdd0.762
configd1.531
dataaccessd10.750
fairplayd.N880.633
fseventsd1.141
geod1.285
iapd1.254
imagent1.992
installd4.457
launchd0.480
locationd3.000
lockdownd1.312
mDNSResponder0.754
mediaremoted0.801
mediaserverd4.160
misd0.168
networkd0.582
notifyd0.898
powerd0.543
scrabble15.738
securityd0.848
syslogd0.445
ubd1.273
wifid1.117
yap.TV3.730

Wednesday, May 4, 2011

Enums in Python: a more flexible and powerful implementation

There is no standard implementation of enums (enumerated values) in Python. But enums are very useful and exist in most other programming languages. Hence several "recipes" for enum implementations have been proposed. As is clear from the differences between these implementations (and the comments that others have made about them), there are many different ideas about what the characteristics of an enum should be.

I needed enums for a Python project where the enums would be part of an API that I was defining for use by end-users. My requirements for an enum implementation were:

  1. Allow definition of a named class representing the set of allowable values. (I needed a class, not an instance, since I wanted to be able to specify the enum class as the parameter type for API functions that expected a class.) The enum classes should be derived from a common superclass (so I could test if a type was an enum type).
  2. The allowable values would be represented by instances of the enum class that would be automatically created at definition time. Each instance would have a name and a value (i.e. would be a name/value pair).
  3. The values should not be restricted to be integers (e.g. the values could be floats). But the values should be immutable objects (e.g. not lists)).
  4. Each enum class can be queried to get the list of allowable names and values.
  5. Each enum class can be queried to find the name/value pair instance given the name.
  6. It shouldn't be possible to modify the value of an existing name/value pair.
  7. Allow definition of alternative names for existing enum values. (e.g. so that the name can be changed in a future revision of the API while maintaining backward compatibility for users' source code) but these alternative names should not show up when the list of allowable values is queried.
None of the existing enum implementations fulfilled the above requirements, so I wrote my own. Here's the source code for my implementation: enum.py
(It is published under the MIT licence.)

To define an enum class, you use the 'Enum' function from the above 'enum' module. This function expects its first argument to be a string specifying the name of the enum class you are defining. The subsequent arguments should be keyword arguments specifying the names and values for this enum class. The 'Enum' function returns a class object (which is why its name is capitalized). For example, to define an enum class named 'Colours', you could do this:
from enum import Enum

>>> Colours = Enum('Colours', red=1, green=2, blue=3)

After that, there would be instances Colours.red, Colours.green, Colours.blue representing the possible colour values. To get the value of the 'red' instance, you would use 'Colours.red.value' (which is 1). To get the 'red' instance given a colour string in the variable 'col', you would use 'Colours.get(col)' or 'Colours(col)'. (If that string variable contained anything other than "red", "green" or "blue", you would get an exception.)

Some more examples of use will show the other facilities of this enum implementation:

from enum import Enum

>>> Colours = Enum('Colours', red=1, green=2, blue=3)
>>> Colours.green.name
'green'

>>> Colours.green.value
2

>>> Colours.names()
['red', 'green', 'blue']

>>> Colours.get('blue').value
3

>>> Colours.getByValue(3).name
'blue'

>>> Colours('blue').value
3

>>> Colours.addAlias('bleu', 'blue')
>>> Colours.names()
['red', 'green', 'blue']

>>> Colours.bleu.value
3

>>> [x.name for x in Colours.objects()]
['red', 'green', 'blue']

>>> listOfNames = ['red', 'blue']
>>> [x.name for x in Colours.listOfNamesToListOfEnums(listOfNames)]
['red', 'blue']

>>> commaSepStr = "red, blue"
>>> [x.name for x in Colours.commaSepStrToEnumList(commaSepStr)]
['red', 'blue']


>>> Numbers = Enum('Numbers', pi=3.1415926, e=2.71828)
>>> isinstance(Numbers, type)
True

>>> Numbers.names()
['e', 'pi']

>>> round(Numbers.e.value, 3)
2.718

>>> x = Numbers.pi
>>> x is Numbers('pi')
True

>>> Numbers('phi')
Traceback (most recent call last):
    ...
ValueError: 'phi' is not a valid enum name (Must be one of: ['e', 'pi'])

>>> Numbers.e.value = 42
Traceback (most recent call last):
    ...
TypeError: can't modify an enum

Thursday, April 28, 2011

Where did the time go? (Instrumenting C++/CLI and Python code)

The Problem:
How to speed up your program?
It's working ("Make it right, then make it fast") but it's a bit slower than you hoped. The first step in making it faster is to find out which parts are taking the most time.
I'm not (necessarily) talking about CPU time here - what matters in most cases is how long the user needs to wait until the program finishes its task. The bottlenecks might not be CPU time - e.g. the program might be spending a lot of time doing I/O or waiting for some other program that it interacts with, etc. So what I usually want to know is which parts of the program take up the most wall clock time. There are various sophisticated tools available to do this, but often they won't work for your specific situation (e.g. when your program has different interacting components written in different programming languages). And often you don't want to invest a lot of effort in acquiring and learning to use a sophisticated tool. Something simple is often best. And something that allows you to successively refine your efforts to "zoom in" on the bottlenecks.

The Solution:
A solution that has worked well for me is a utility class that makes it easy to instrument my code, bracketing sections of interest and generating reports at certain points. The reports refer to the sections of the program by the names that I assigned to them when I instrumented the code and show how much time elapsed in each section.

I've implemented a utility class called "TimeSpent" in Python and in C++/CLI for .NET. I supply the code below but here's an example of how to use it in Python (the usage in C++ is very similar). Suppose your Python program looks like this:
#!/usr/bin/env python
from myStuff import (func1, func2, func3, func4)

x = 1
func2('hello')
func1(x)
func3()
x = func4()
func1(x)
func2('goodbye')

To find out how much each part of this program contributes to the overall execution time, we instrument the code by creating an instance of the 'TimeSpent' class and adding calls to the 'start' and 'stop' methods around each section of interest:
#!/usr/bin/env python
from myStuff import (func1, func2, func3, func4)
from timeSpent import TimeSpent

x = 1
timeSpent = TimeSpent()
timeSpent.start("A")
func2('hello')
func1(x)
timeSpent.stop("A")
func3()
timeSpent.start("B")
x = func4()
timeSpent.stop("B")
timeSpent.start("A")
func1(x)
func2('goodbye')
timeSpent.stop("A")
print timeSpent.generateReportAndReset()

Each call to the 'start' method starts a stopwatch (creating it automatically if one with the specified name doesn't exist) and the corresponding 'stop' method stops that stopwatch. The accumulated times for all of the stopwatches is reported via the 'generateReportAndReset' method. Above I have named the stopwatches "A" and "B" and have used stopwatch "A" to measure the two separate sections where 'func1' and 'func2' are called.
Here's the output from the instrumented version of the program:
hello
goodbye
Elapsed seconds: 0.7
Breakdown of time spent (in milliseconds):
A: 629 (count: 2 avg: 314.6)
B: 95 (count: 1 avg: 95.0)
Note that the count for "A" is 2 - this reflects the fact that stopwatch "A" was used twice. The time of 629 milliseconds is the sum of the times taken for the two sections of code where stopwatch "A" was used.
Running the instrumented program again shows a longer execution time:
hello
goodbye
Elapsed seconds: 1.2
Breakdown of time spent (in milliseconds):
A: 1070 (count: 2 avg: 534.8)
B: 103 (count: 1 avg: 102.7)
And further runs would show that the "A" time is sometimes around 600 ms but other times around 1100 ms. To investigate this, we change the stopwatch name to "A1" for the first section and "A2" for the second section. Here's some outputs from this version of the program:
hello
goodbye
Elapsed seconds: 0.7
Breakdown of time spent (in milliseconds):
A1: 314 (count: 1 avg: 314.2)
A2: 293 (count: 1 avg: 292.9)
B: 97 (count: 1 avg: 96.7)
hello
goodbye
Elapsed seconds: 1.3
Breakdown of time spent (in milliseconds):
A1: 310 (count: 1 avg: 310.0)
A2: 843 (count: 1 avg: 842.7)
B: 105 (count: 1 avg: 104.6)
So it seems that the variation in overall execution time is due to variations in section "A2". With the hypothesis that the variation might be due to different values of 'x', let's change the names of the stopwatches to incorporate the 'x' value in their names:
#!/usr/bin/env python
from myStuff import (func1, func2, func3, func4)
from timeSpent import TimeSpent

x = 1
timeSpent = TimeSpent()
timeSpent.start("A (x=%d)" % x)
func2('hello')
func1(x)
timeSpent.stop("A (x=%d)" % x)
func3()
timeSpent.start("B")
x = func4()
timeSpent.stop("B")
timeSpent.start("A (x=%d)" % x)
func1(x)
func2('goodbye')
timeSpent.stop("A (x=%d)" % x)
print timeSpent.generateReportAndReset()
Here's some outputs from this version of the program:
hello
goodbye
Elapsed seconds: 0.7
Breakdown of time spent (in milliseconds):
A (x=1): 600 (count: 2 avg: 300.1)
B: 98 (count: 1 avg: 97.8)
hello
goodbye
Elapsed seconds: 1.2
Breakdown of time spent (in milliseconds):
A (x=-1): 783 (count: 1 avg: 783.3)
A (x=1): 284 (count: 1 avg: 283.8)
B: 108 (count: 1 avg: 108.3)
The above shows that when x is -1, the second "A" section takes a lot longer than when x is 1.

The ability to use arbitrary strings to name the stopwatches is one of the things that make this utility class user-friendly (here the user is the programmer). Preventing or detecting common errors is another aspect of user (programmer) friendliness. The usage of the TimeSpent class relies on the names matching up in the 'start' and 'stop' methods. What happens if the programmer slips up? Here's an example with two slip-ups: the call to stop("8") should have been stop("B"), and the call to start("A") on the second last line should have been stop("A")
#!/usr/bin/env python
from myStuff import (func1, func2, func3, func4)
from timeSpent import TimeSpent

x = 1
timeSpent = TimeSpent()
timeSpent.start("A")
func2('hello')
func1(x)
timeSpent.stop("A")
func3()
timeSpent.start("B")
x = func4()
timeSpent.stop("8")
timeSpent.start("A")
func1(x)
func2('goodbye')
timeSpent.start("A")
print timeSpent.generateReportAndReset()
The output from running the above version of the program makes it clear to the programmer that there were some mistakes:
hello
WARNING: unknown stopwatch name passed to 'stop' (8)
goodbye
WARNING: start(A) called when stopwatch is already running
Elapsed seconds: 0.7
Breakdown of time spent (in milliseconds):
A: 599 (count: 2 avg: 299.4)
B: 390 (count: 1 avg: 390.5)

The is no limit to how many stopwatches can be running at once. If you want to "zoom in" on a section of code, just bracket the sub-sections with more stopwatches (named appropriately) and the report will show the time for the whole section as well as the breakdown by sub-section.

If you want to track the time spent in code that is in many different files, it is convenient to use the 'getGlobalInstance' method of the 'TimeSpent' class - it gives you access to a singleton instance that you can use anywhere in your program.

By the way, here's the 'myStuff' module I used in the above test program:
#!/usr/bin/env python
from __future__ import division
import time
import random

def wasteTime(numMillis, stdDev=None):
    if stdDev is None:
        stdDev = numMillis / 15
    time.sleep(random.gauss(numMillis, stdDev) / 1000)

def func1(x):
    if x > 0:
        wasteTime(100)
    else:
        wasteTime(600)

def func2(msg):
    wasteTime(200)
    print msg

def func3():
    wasteTime(10)

def func4():
    wasteTime(100)
    if random.random() < 0.5:
        return 1
    else:
        return -1


Source code:
The source code for the TimeSpent class (in Python and in C++/CLI (.NET)) is published under the MIT licence and is available from the following links:

Thursday, April 21, 2011

Minimize Nesting of If Statements

The essay "Code in Motion" by Laura Wingerd and Christopher Seiwald of Perforce Software (from the book "Beautiful Code") emphasizes the importance of readability of source code for avoiding bugs. One of their style rules is to minimize the nesting of 'if' statements.
Here's a graph showing an increase in bugs possibly correlated with an
increase in nesting levels in the Perforce "DiffMerge" program source code:

Monday, February 2, 2009

Puzzling over 'du' (disk usage)

Something (with 'du') that had me puzzled for a while:

I was looking at how much space was taken up by my MacPorts installation (on OS X 10.5.6).
I used the following alias to give me a breakdown by directory:
% alias sudodiskspace='sudo du -x -h -d 1'

% sudodiskspace /opt/local
9.1M /opt/local/bin
284K /opt/local/etc
12M /opt/local/include
83M /opt/local/lib
32K /opt/local/libexec
69M /opt/local/Library
68K /opt/local/sbin
107M /opt/local/share
175M /opt/local/var
455M /opt/local


% sudodiskspace /opt/local/var
1.0M /opt/local/var/cache
447M /opt/local/var/macports
448M /opt/local/var


That seemed strange - above it was reporting that the /opt/local/var directory was taking 175 MB, but now it is reporting that it takes up 448 MB.
I.e. 'du' seemed to be misreporting the space taken by sub-directories.

I thought something must be wrong and even resorted to looking at the source code for 'du' and I was considering building a version of 'du' that I could add debugging statements to.
But I decided to first look closer at what was in the /opt/local/var directory and so I did:
% ls -lR /opt/local/var | more
and scanning through the output, I started to notice that most of the files were listed as having 2 hard-links. Hmmm - that's unusual.
Sure enough, the files under /opt/local/var also occur (via hard-links) under /opt/local/include, /opt/local/lib, /opt/local/share, ...
So 'du' is giving me an accurate breakdown after all. But this points out that the breakdown from 'du' will be somewhat arbitrarily distributed across the sub-directories if they are sharing files via hard-links. (It depends on the order of traversal of the directories.)

Saturday, January 31, 2009

Installing Pygame on Leopard (OS X 10.5)

I was disappointed that the binary installers for Pygame available at pygame.org expect that you have installed a version of Python from python.org when Leopard (OS X 10.5) already comes with a perfectly good version of Python. I didn't want to install a whole separate version of Python just for Pygame.

So I decided to install Pygame from source. I followed the instructions on the "MacCompile" wiki at python.org.

Step 1:
I downloaded the latest versions of the SDL frameworks binaries for OS X (sometimes a bit more recent than the versions indicated on the above page):
SDL.framework 1.2.13
SDL_ttf.framework 2.0.9
SDL_mixer.framework 1.2.8
SDL_image.framework 1.2.7
and I copied the above framework folders from the DMGs to the /Library/Frameworks folder

Step 2:
I downloaded the source for the Numeric module (24.2) and then ran the command:
sudo python setup.py install
which installed some header files under /System/Library/Frameworks/Python.framework/Versions/2.5/include/python2.5/Numeric
and installed the Numeric module in /Library/Python/2.5/site-packages

Step 3:
I downloaded the combo installer for libpng & libjpeg from ethan.tira-thompson.com
and then renamed the libjpeg.dylib & libpng.dylib as recommended by the above instructions.

Step 4:
I downloaded the PyGame 1.8.1 source
and then (in the directory pygame-1.8.1release) ran the following command:
python config.py
which gave the following output:

Using Darwin configuration...
Hunting dependencies...
Framework SDL found
Framework SDL_ttf found
Framework SDL_image found
Framework SDL_mixer found
Framework smpeg not found
PNG : found
JPEG : found
SCRAP : not found


Then I ran the following commands:
python setup.py build
sudo python setup.py install



As mentioned above, the Numeric module puts some header files into a "Numeric" sub-directory under /System/Library/Frameworks/Python.framework. That's not too bad, but I don't see why they couldn't have gone under /Library instead. The pygame stuff seems to have gone under /Library/Python/2.5/site-packages.

I tested the installation by running some of my Pygame programs - and it all seems to work fine.