Python: Passing data without passing it

Posted by Moser on 11 Feb 2021

In this post I will show you how to pass data to places where you cannot pass it as arguments.

As an example, let’s say we have a function that we cannot change which calls our code (MyAction.action). Because we cannot change function_i_cannot_change, we cannot make it pass dynamic_value through to our code. Also note that we cannot pass something else than a number as the arg parameter because of its implementation. (Yes, in this simple example there would be alternative solutions, see note at the end.)

from typing import Callable

def function_i_cannot_change(callback : Callable[[int], int], arg : int) -> int:
    arg += 1
    # lots of code
    result = callback(arg)
    # more code
    return result


def action(arg):
    # hmm, how do I get `dynamic_value` into this method?
    return arg + dynamic_value


def usage():
    dynamic_value = 100
    function_i_cannot_change(action, 1)

We can solve this by passing the data through a global variable.

DYNAMIC_GLOBAL = 1

def action(arg):
    return arg + DYNAMIC_GLOBAL

def usage():
    global DYNAMIC_GLOBAL 
    DYNAMIC_GLOBAL = 100
    function_i_cannot_change(action, 1)  # returns 102

Yes, you’re right… Using globals is bad. It makes your code hard to follow and breaks thread-safety.

To address the threading issue, we can use a threading.local instead:

import threading

DYNAMIC_TL = threading.local()
DYNAMIC_TL.value = 1

def action(arg):
    return arg + DYNAMIC_TL.value

def usage():
    DYNAMIC_TL.value = 100
    function_i_cannot_change(action, 1)  # returns 102

This solves threading issues, but we still have to know the thread local in the consuming code and we have to trust all consumers to be well-behaved concerning what they set on DYNAMIC_TL.value and that they reset it to the original value.

Let’s use a context manager to make the usage more safe and clear for the consumers.

import threading
import contextlib

DYNAMIC_TL = threading.local()
DYNAMIC_TL.value = 1

@contextlib.contextmanager
def override_dynamic_value(value):
    # we can do assertions on the value here
    old_value = DYNAMIC_TL.value
    DYNAMIC_TL.value = value
    yield
    DYNAMIC_TL.value = old_value

def get_dynamic_value():
    return DYNAMIC_TL.value


def action(arg):
    return arg + get_dynamic_value()


def usage():
    with override_dynamic_value(100):
        function_i_cannot_change(action, 1)  # returns 102

I also added get_dynamic_value to hide the usage of the thread local from the implementation of action, too.

This sounds a bit crazy, does anybody really do this?

Yes. Have you ever wondered how flask.request works?

Can I really use this?

Well, I would say: Yes, if there is no simpler alternative. The pattern comes with the general problem of global variables:

a) They are available everywhere and thus people will starting using them in more and more places. This leads to unintended coupling.

b) The dataflow is not obvious. You will get surprising and hard to debug errors.

For me, this means that I only use the pattern in a very explicit way that makes sure it will be used responsibly. I would move the thread local and it’s accessors to their own module and make the thread local itself private.

I often find myself using this technique in tests, when I need to get 3rd party code to cooperate. In production code I would try very hard to find an alternative.

Note: Alternative solutions for the simple example

The simple example has a couple pretty simple, preferable solutions,

E.g. there is partial application of functions:

import functools

def action(dynamic_value, arg):
    return arg + dynamic_value

def usage():
    function_i_cannot_change(functools.partial(action, 100), 1)  # returns 102

Or a function that returns a function:

def action_factory(dynamic_value):
    def action(arg):
        return arg + dynamic_value

def usage():
    function_i_cannot_change(action_factory(100), 1)  # returns 102

Or a class:

class MyAction:
    def __init__(self, dynamic_value):
        self.dynamic_value = dynamic_value

    def action(self, arg):
        return arg + self.dynamic_value

def usage():
    function_i_cannot_change(MyAction(100), 1)  # returns 102

Setup Pylint pre-commit hook

Posted by Moser on 28 Oct 2020

A short guide on the Pylint hook for pre-commit.

1) Install pre-commit

In a Python project you should add the pre-commit package to your development requirements (requirements-dev.txt, Pipfile or whatever you are using).

You can also install it directly via pip:

pip install pre-commit

Check the installation by running

pre-commit --version

2) Add a config file

Create a file called .pre-commit-config.yaml in the root directory of your git repository and add the following content:

repos:
  - repo: https://github.com/pycqa/pylint
    rev: pylint-2.6.0
    hooks:
    -   id: pylint

3) Install the hook

The following command will install a small script to .git/hooks which calls the pre-commit tool.

pre-commit install

4) Try to commit a file with an issue

Prepare a broken file:

echo "import foobaz" > test.py && git add test.py

When you try to commit this, pylint will fail and stop you from doing the commit.

$ git commit -m "Add test.py"
[INFO] Initializing environment for https://github.com/pycqa/pylint.
[INFO] Installing environment for https://github.com/pycqa/pylint.
[INFO] Once installed this environment will be reused.
[INFO] This may take a few minutes...
pylint...................................................................Failed
- hook id: pylint
- exit code: 22

************* Module test
test.py:1:0: C0114: Missing module docstring (missing-module-docstring)
test.py:1:0: E0401: Unable to import 'foobaz' (import-error)
test.py:1:0: W0611: Unused import foobaz (unused-import)

---------------------------------------------------------------------
Your code has been rated at -60.00/10 (previous run: 2.22/10, -62.22)

$ git status --short
A  test.py

When pre-commit runs for the first time, it installs pylint into a virtualenv in a special folder in ~/.cache/pre-commit. This installation is reused on following runs.

Disable the hook for a commit

Some times you may want to commit a file even though it has an issue. You can disable the pylint hook by setting a environment variable, e.g. like this:

SKIP=pylint git commit ...

If want to disable all hooks while committing, use the --no-verify/-n option:

git commit -n ...

More hooks

There are dozens of tools that can be run in a pre-commit hook. In my Python projects I usually have pylint, black and isort activated. Check out the full list.

Run Pylint automatically on your project

I created a product that runs Pylint and other linters on your Github projects automatically: Please have a look at PyCodeQual.

SQLAlchemy: Prevent implicit cross join (cartesian product)

Posted by Moser on 02 Jan 2020

I really like to create SQL queries using SQLAlchemy’s explicit and declarative API. When using this instead of raw strings, roughly half of errors I tend to introduce are caught before even sending the query.

Here is an example:

import sqlalchemy as sa

metadata = sa.MetaData()
a = sa.Table(
    "a",
    metadata,
    sa.Column("a_id", sa.Integer, primary_key=True, autoincrement=True),
    sa.Column("name", sa.String),
)
b = sa.Table(
    "b",
    metadata,
    sa.Column("b_id", sa.Integer, primary_key=True, autoincrement=True),
    sa.Column("a_id", sa.Integer, sa.ForeignKey(a.c.a_id)),
)

def create_select(additional_filters):
    return sa.select([a], whereclause=sa.and_(*additional_filters))

print(create_select([a.c.name == 'Foo']))
# SELECT a.a_id, a.name
# FROM a
# WHERE a.name = :name_1

A huge advantage over dealing with string queries is that you can create different parts of the query on their own and combine them. In the example I create the SELECT statements in a central place and allow to pass in parts of the where clause.

Unfortunately, this pattern has a dangerous property: It will implicitly add a CROSS JOIN (or cartesian product) when you pass a filter expression that contains a column from a table that is not part of the select statement already.

print(create_select([a.c.name == 'Foo', b.c.b_id == 1]))
# SELECT a.a_id, a.name
# FROM a, b
# WHERE a.name = :name_1 AND b.b_id = :b_id_1

On small tables this will just create an unexpected result set, but when the involved tables are large this query might well exhaust the DB server’s resources. The concrete problem can be fixed by adding a join to all the tables that should be allowed in the filters:

def create_select_corrected(additional_filters):
    return sa.select([a], from_obj=a.join(b), whereclause=sa.and_(*additional_filters))
print(create_select_corrected([a.c.name == 'Foo', b.c.b_id == 1]))
# SELECT a.a_id, a.name
# FROM a JOIN b ON a.a_id = b.a_id
# WHERE a.name = :name_1 AND b.b_id = :b_id_1

A more abstract problem is that we can create queries that will lead to unexpected results without noticing. A good but pretty expensive solution would be to validate the parts that are passed in. Depending on how variable the input can be, I would go for this solution. But in other cases it’s just myself using my query construction logic in new ways. So I prefer a cheaper solution that makes me aware of the problem.

Update 2020-05-26: SQLAlchemy 1.4

Starting with version 1.4 SQLAlchemy provides a better and safer way to discover the problem. It will issue a warning whenever the compiled SQL contains a cartesian product.

Old solution (still necessary for SQLAlchemy < 1.4)

Thankfully, SQLAlchemy’s events allow us to be notified when a query is about to be executed. We can create a listener that raises an exception when we try to run a problematic query.*

def before_execute(conn, clauseelement, multiparams, params):
    if (
        isinstance(clauseelement, sa.sql.selectable.Select)
        and len(clauseelement.froms) > 1
    ):
        raise RuntimeError("Cross join detected:\n{}".format(clauseelement))

sa.event.listen(engine, "before_execute", before_execute)

It’s not perfect because it does not check subqueries or CTEs but it gives us a line of defense. I am also thinking of adding an assertion that checks for the problem to the tests that exercise the query construction logic.

Here is the complete example code: sqlalchemy_implicit_cross_join.py.

pyplot: Histogram on X/Y axis

Posted by Moser on 23 Mar 2018

Last week a colleague asked me for help on a figure he wanted to plot using pyplot. He had a couple of line plots and an intersecting line and wanted to show the histogram of the intersection points on both the X and Y axis.

I found a neat way of acomplishing such a plot using a GridSpec.

Here is the code, using a regular scatter plot as the central element, but as every part of the figure is filled separately you can swap any of them out. You have to make sure though that the histogram plots and the main plot have the same X and Y axes respectively.

I remove the axes from the main plot (set_axis_off()) because they looked redundant. tight_layout is used to remove the gaps between the subplots. You can tweak width_ratios and height_ratios to change how much space is allocated to the histograms and the main plot.

PostgreSQL: JSON aggregation

Posted by Moser on 17 Mar 2018

On client project I have to do a search in a structure of tables and return results including some child objects. A classical instance of the N+1 query problem, if I was using an ORM. I decided not to use the ORM for the feature because it will be one of the hottest paths of the application and I wanted more control over the queries (rather complex logic with LIKE & sorting). But the filtering/sorting is not the topic today, so I will leave it out in the examples.

For illustration, let’s assume the following schema:

CREATE TABLE parents (
  parent_id INTEGER PRIMARY KEY,
  name VARCHAR
);

CREATE TABLE children (
  child_id INTEGER PRIMARY KEY,
  name VARCHAR,
  birthdate DATE,
  parent_id INTEGER,
  FOREIGN KEY (parent_id) REFERENCES parents
);

Without any further thinking I would do a query like this and a bit of code that constructs parent objects with their respective children.

SELECT * FROM parents JOIN children USING (parent_id) ORDER BY parent_id;
 parent_id | name  | child_id |  name   | birthdate  
-----------+-------+----------+---------+------------
         1 | Jim   |        1 | Tamara  | 2017-02-01
         1 | Jim   |        3 | Tom     | 2005-10-01
         1 | Jim   |        5 | Tonja   | 2011-07-17
         2 | Jenny |        2 | Tim     | 2000-11-02
         2 | Jenny |        4 | Theresa | 2017-04-30

Just before I started writing the logic to pull apart the result and put it into it’s object structure, I thought, “It would be nice to let the database put together all the children of one parent into an array.”

I already knew array_agg which aggregates all values into an array. After some reading I discovered json_build_object which takes a sequence of keys and values and creates a JSON object from it.

So my final query looked like this:

SELECT
    parent_id
  , p.name
  , array_agg(json_build_object(
      'child_id', c.child_id
    , 'name', c.name
    , 'birthdate', c.birthdate
  )) as children
FROM parents p
JOIN children c USING (parent_id)
GROUP BY 1, 2;
 parent_id | name  |                                                                                                             children                                                                                                              
-----------+-------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
         1 | Jim   | {"{\"child_id\" : 1, \"name\" : \"Tamara\", \"birthdate\" : \"2017-02-01\"}","{\"child_id\" : 3, \"name\" : \"Tom\", \"birthdate\" : \"2005-10-01\"}","{\"child_id\" : 5, \"name\" : \"Tonja\", \"birthdate\" : \"2011-07-17\"}"}
         2 | Jenny | {"{\"child_id\" : 2, \"name\" : \"Tim\", \"birthdate\" : \"2000-11-02\"}","{\"child_id\" : 4, \"name\" : \"Theresa\", \"birthdate\" : \"2017-04-30\"}"}

When the query is executed with sqlalchemy, children in the resulting rows is already correctly typed as a list of dictionaries.

The explain analyze output for both queries (on a trivially small test set) shows that the aggregated version is 50-100% slower (150-200μs vs. 250-350μs), but I guess that will rarely be a real problem because - at least in my case

  • the execution time of my query is dominated by filtering/sorting the parent rows.

If you would like to play with the example yourself, get the SQL file here.

Python: bare except statement - why it is bad.

Posted by Moser on 20 Sep 2017

The bare except statement is something I often see in the main loop of a script that should run indefinitely. Pylint warns you about this problem (W0702: No exception type(s) specified (bare-except)), but in my opinion we should start to treat this as an error like flake8 does (E722 do not use bare 'except').

Consider the following code snippet:

import sys
import time

def risky_business():
  time.sleep(1)

while(True):
  try:
    risky_business()
  except:
    print("Problems?", sys.exc_info())

This will work nicely to keep your program running whatever happends inside you risky_business function. But let’s try to stop your program by pressing CTRL-C.

$ python a.py 
^C('Problems?', (<type 'exceptions.KeyboardInterrupt'>, KeyboardInterrupt(), <traceback object at 0x7f234f6c9dd0>))

We caught the KeyboardInterrupt exception and just carried on. But we catch more which we should not. Let’s try to exit the program from the business function, maybe because the user asked for it or we got signaled.

def risky_business():
  time.sleep(1)
  sys.exit(0)
$ python a.py 
('Problems?', (<type 'exceptions.SystemExit'>, SystemExit(0,), <traceback object at 0x7f49b7ab5e18>))
('Problems?', (<type 'exceptions.SystemExit'>, SystemExit(0,), <traceback object at 0x7f49b7ab5ea8>))
...

So why is this happening? Both KeyboardInterrupt and SystemExit are exceptions (derived from BaseException). An except statement without any restriction will catch any exception derived from BaseException. Here is the complete hierachy of the builtin exceptions.

We can fix our problem by restricting the kind of exception that we want to catch:

def risky_business():
  time.sleep(1)

while(True):
  try:
    risky_business()
  except Exception:
    print("Problems?", sys.exc_info())

When we press CTRL-C now, the program exits with the right exception.

$ python a.py
^CTraceback (most recent call last):
  File "a.py", line 10, in <module>
    risky_business()
  File "a.py", line 6, in risky_business
    time.sleep(1)
KeyboardInterrupt

Pylint will still warn you about this line, because catching Exception is still very broad and you should only do this when you have a good reason, e.g. the reason mentioned above (to keep a service running even if it could not handle a certain input or request).

So next time you see a bare except statement in your code, take a moment to reconsider if you really want to catch SystemExit, KeyboardInterrupt (and GeneratorExit which I did not mention above). If that is the case, you might want to make it explicit so that the next person reading the code does not have to check again.