Week 4: A conformance week
cwltool project has two types of tests: unit tests and conformance tests. This week, I was finally able to get all tests passing in Python 3 runtime! I was quickly able to hack cwltest module to run in Python 3, thanks to 2to3 utility. I then ran the testing script and for each failing test, debugged using PyCharm’s debugger. This workflow proved to be very effective and having a debugger+good logging in the project is really really helpful. I can’t stress it enough.
From here on, I need to do a fair bit of cleanup and document my changes so that the code is maintainable in the future. Definitely getting close to release of new version of cwltool which supports both Python 2.7 and Python 3.
I would like to highlight a few interesting things from my task of debugging conformance tests.
1. Sorting heterogeneous lists
I came across this problem during sorting dicts. In shorts, if you need to sort such a list,
a = a = [[-1000, 'py'], [-1, 'args'], [0,0], [0, 'datum'], [1, 'wow']]
print(a.sort())
Ouput in py2:
[[-1000, 'py'], [-1, 'args'], [0, 0], [0, 'datum'], [1, 'wow']]
Output in py3:
TypeError Traceback (most recent call last)
<ipython-input-2-3aec292202c5> in <module>()
1 a = [[-1000, 'py'], [-1, 'args'], [0,0], [0, 'datum'], [1, 'wow']]
----> 2 print(a.sort())
TypeError: unorderable types: str() < int()
So Python 3 doesn’t support ordering of different types, thus can’t sort lists containing hetrogenous types by default.
This is a custom comparator cmp
function I came up with, which re-creates python2 like sorting nature (to the extent we require):
from itertools import zip_longest
def cmp_hack(a, b):
# iterate through both list till max of size of both lists
for i,j in zip_longest(a,b):
if i == j:
continue
# in case 1st list is smaller
# should come first in sorting
if i == None:
return 1
# if 1st list is longer,
# it should come later in sort
elif j == None:
return -1
# if either of the list contains str element
# at any index, both should be str before comparing
if isinstance(i, str) or isinstance(j, str):
return 1 if str(i) > str(j) else -1
# int comparison otherwise
return 1 if i > j else -1
# if both lists are equal
return 0
This function also takes care of the case when the lists to be compared are not of same size.
Since, cmp
is not supported anymore on Python 3, cmp_to_key
function came in handy which is already provided in functools
stlib from Python version >= 3.2
This is what the final working code looks like:
from functools import cmp_to_key
a.sort(key=cmp_to_key(cmp_hack))
2. Taking care of binary strings before passing to json.dumps()
Given a nested dict like this:
import json
a = {'r': b'try', 'd' : {'low': b'key', 'f' : \
{'wpw': b'lol', 'self': [{'wow': b'again'}]}}}
json.dumps(a)
Output in Py2:
'{"r": "try", "d": {"low": "key", "f": {"self": [{"wow": "again"}], "wpw": "lol"}}}'
Output in Py3:
TypeError: b'key' is not JSON serializable
This error comes up because of clear distinction between binary and unicode strings in Python 3.
I wrote a function which recursively replaces all binary strings in a dict of nested dicts and lists into corresponding decoded unicode strings. Works alike in py2 and py3.
def bytes2str_in_dicts(a):
# if input is dict, recursively call for each value
if isinstance(a, dict):
for k, v in dict.items(a):
a[k] = bytes2str_in_dicts(v)
return a
# if list, iterate through list and fn call
# for all its elements
if isinstance(a, list):
for idx, value in enumerate(a):
a[idx] = bytes2str_in_dicts(value)
return a
# if value is bytes, return decoded string,
elif isinstance(a, bytes):
return a.decode('utf-8')
# simply return elements itself
else:
return a
3. Special apache avro mention:
Earlier this week I was still stuck at avro-python3 package errors. One of my mentors, Anton helped me fix multiple failing tests, which were due to a bug in the avro-python3 code. Moving on from that, I got stuck at another set of errors, which too gave traceback to avro-python3 package, not sure if the error was in our codebase or avro’s this time around.
As a result, I decided to migrate existing Python 2 implementation of avro package using 2to3
and some grunt work. A few hours later, I was able to port it to Python 3. I tried using this package in my code and to my surprise worked well without any changes in cwltool codebase.
In the coming week I need to come up with a concrete solution to the avro problem. The ultimate goal is to have robust codebase which is maintainable in the long run.
Feel free to check my progress at: cwltool/pull/442
To Do next week
- decide on apache avro
- cleanup and push all progress to github
- revisit and fix mypy errors after this week’s changes