Week 4: A conformance week

cwltool project has two types of tests: unit tests and conformance tests. This week, I was finally able to get all tests passing in Python 3 runtime! I was quickly able to hack cwltest module to run in Python 3, thanks to 2to3 utility. I then ran the testing script and for each failing test, debugged using PyCharm’s debugger. This workflow proved to be very effective and having a debugger+good logging in the project is really really helpful. I can’t stress it enough.

From here on, I need to do a fair bit of cleanup and document my changes so that the code is maintainable in the future. Definitely getting close to release of new version of cwltool which supports both Python 2.7 and Python 3.

I would like to highlight a few interesting things from my task of debugging conformance tests.

1. Sorting heterogeneous lists

I came across this problem during sorting dicts. In shorts, if you need to sort such a list,

a = a = [[-1000, 'py'], [-1, 'args'], [0,0], [0, 'datum'], [1, 'wow']]
print(a.sort())

Ouput in py2:

[[-1000, 'py'], [-1, 'args'], [0, 0], [0, 'datum'], [1, 'wow']]

Output in py3:

TypeError  Traceback (most recent call last)
<ipython-input-2-3aec292202c5> in <module>()
      1 a = [[-1000, 'py'], [-1, 'args'], [0,0], [0, 'datum'], [1, 'wow']]
----> 2 print(a.sort())
TypeError: unorderable types: str() < int()

So Python 3 doesn’t support ordering of different types, thus can’t sort lists containing hetrogenous types by default. This is a custom comparator cmp function I came up with, which re-creates python2 like sorting nature (to the extent we require):

from itertools import zip_longest
def cmp_hack(a, b):
    # iterate through both list till max of size of both lists
    for i,j in zip_longest(a,b):
        if i == j:
            continue
        # in case 1st list is smaller
        # should come first in sorting
        if i == None:
            return 1
        # if 1st list is longer,
        # it should come later in sort
        elif j == None:
            return -1
        # if either of the list contains str element
        # at any index, both should be str before comparing
        if isinstance(i, str) or isinstance(j, str):
            return 1 if str(i) > str(j) else -1
        # int comparison otherwise
        return 1 if i > j else -1
    # if both lists are equal
    return 0

This function also takes care of the case when the lists to be compared are not of same size. Since, cmp is not supported anymore on Python 3, cmp_to_key function came in handy which is already provided in functools stlib from Python version >= 3.2

This is what the final working code looks like:

from functools import cmp_to_key
a.sort(key=cmp_to_key(cmp_hack))

2. Taking care of binary strings before passing to json.dumps()

Given a nested dict like this:

import json
a = {'r': b'try', 'd' : {'low': b'key', 'f' : \
                          {'wpw': b'lol', 'self': [{'wow': b'again'}]}}}
json.dumps(a)

Output in Py2:

'{"r": "try", "d": {"low": "key", "f": {"self": [{"wow": "again"}], "wpw": "lol"}}}'

Output in Py3:

TypeError: b'key' is not JSON serializable

This error comes up because of clear distinction between binary and unicode strings in Python 3.
I wrote a function which recursively replaces all binary strings in a dict of nested dicts and lists into corresponding decoded unicode strings. Works alike in py2 and py3.

def bytes2str_in_dicts(a):
    # if input is dict, recursively call for each value
    if isinstance(a, dict):
        for k, v in dict.items(a):
            a[k] = bytes2str_in_dicts(v)
        return a

    # if list, iterate through list and fn call
    # for all its elements
    if isinstance(a, list):
        for idx, value in enumerate(a):
            a[idx] = bytes2str_in_dicts(value)
            return a

    # if value is bytes, return decoded string,
    elif isinstance(a, bytes):
        return a.decode('utf-8')

    # simply return elements itself
    else:
        return a

3. Special apache avro mention:

Earlier this week I was still stuck at avro-python3 package errors. One of my mentors, Anton helped me fix multiple failing tests, which were due to a bug in the avro-python3 code. Moving on from that, I got stuck at another set of errors, which too gave traceback to avro-python3 package, not sure if the error was in our codebase or avro’s this time around.

As a result, I decided to migrate existing Python 2 implementation of avro package using 2to3and some grunt work. A few hours later, I was able to port it to Python 3. I tried using this package in my code and to my surprise worked well without any changes in cwltool codebase. In the coming week I need to come up with a concrete solution to the avro problem. The ultimate goal is to have robust codebase which is maintainable in the long run.

Feel free to check my progress at: cwltool/pull/442

To Do next week

decide on apache avro
cleanup and push all progress to github
revisit and fix mypy errors after this week’s changes