McSinyx

Inclusion in Python Environment Algebra

Note

If you do not want to hear me rambling about the shortcomings of Python packaging metadata standards, feel free to skip to section 1.c.

  1. Background
    1. Motivation
    2. Inspiration
    3. Objective
    4. Theory
  2. Implementation
    1. Generation
    2. Conversion
    3. Verification

Background

Motivation

I sometimes praise Python wheels as the best binary packaging format for cross-platform delivery. It truly is, at least to my limited knowledge. Via compatibility tags, the constraints for the target environment can be exclusively specified. Even though PEP 425#compressed-tag-sets was introduced

to allow for compact filenames of bdists[1] that work with more than one compatibility tag triple,

the filenames are not always compact and it is not always possible to specify all environments a module can work on either. For instance, there's no direct answer for one of the FAQ:

What tag do I use if my distribution uses a feature exclusive to the newest version of Python?

I have seen workarounds like cp35.cp36.cp37.cp38.cp39, which accept CPython from version 3.5, but not 3.10, which is the topic of the next Q&A:

Why isn't there a . in the Python version number?

CPython has lasted 20+ years without a 3-digit major release. This should continue for some time. Other implementations may use _ as a delimiter, since both - and . delimit the surrounding filename.

The platform tag was not carefully standardized either, which gives us inconsistent cyphers such as win32, win_amd64 and manylinux2010_x86_64.[2]

Inspiration

A few years later, PEP 508#environment-markers came along after multiple iterations (PEP 345, PEP 426) with a much more human-readable format, which is essentially a Boolean algebra of comparisons. With this, the aforementioned CPython constraint can be correctly expressed as

platform_python_implementation == 'CPython'
and python_version >= '3.5'

Unfortunately, they are only standardized for marking dependencies, usually to watch out for environments where a library is either unnecessary or unusable. But who watches the watchers? While it is possible to specify Python and platform tags when building wheels, the feature is undocumented and almost unknown. Therefore, many, if not most, platform-specific pure-Python wheels are tagged as none platform and all their dependents needs to have the same redundant markers. Furthermore, due to the exclusive nature of compatibility tags, a GNU/Linux-only purelib wheel should be tagged with something like the following abomination:

manylinux_2_17_x86_64.manylinux_2_17_i686.manylinux_2_17_aarch64.manylinux_2_17_armv7l.manylinux_2_17_ppc64.manylinux_2_17_ppc64le.manylinux_2_17_s390x

Now imagine the filename of a wheel that only works on Unix-like systems. uvloop is one and it is among the 500 most-downloaded on PyPI at the time of writing.

Objective

This article is, though, not going to sell you the idea of porting environment markers to top-level WHEEL metadata. I should sit down with the PyPA people and discuss it one day, but the currently defined markers are lacking CPU and ABI information. The reason for this is that Python's standard library does not implement a way to retrieve such information, and for completeness, the operating system's version is not meaningfully available either. If we want to redo this right, I highly recommend looking into Zig's std.Target, which was designed from the start for cross-compilation. Whatever we will come up with, it shall not be done overnight.

Instead, right now I need to solve a more immediate problem. For the last few months, I have been working on IPWHL, a downstream repository for Python distributions. Since we enforce that all dependencies to be satisfied, packages with too strict or too lax environment declaration will break the automated tests, even though effectively the declared packages integrate well with each other. We came to a conclusion that adding information would improve the checks' accuracy and decided to use the floating cheeses as a testbed for top-level environment specifiers.

In short, we are going to merge the METADATA field Requires-Python with other environmental information to a single field environment. Since other information may or may not be present in the filename, during metadata integrity check we can only check if what declared is included by what found from the original wheel.

Theory

Ignoring extra, a Python environment can be represented as a 11-tuple of

(python_version, python_full_version, os_name, sys_platform,
 platform_release, platform_system, platform_version,
 platform_machine, platform_python_implementation,
 implementation_name, implementation_version)

Denote SS as the set of all strings, then the set of all environments ES11E \subset S^{11}. Let

O={,<,,=,,>,,,,,,}O = \{\le, <, \ne, =, \ge, >, \cong, \equiv, \in, \notin, \ni, \notni\}

the set of all marker expressions[3] can be defined as

X={eesOsS}X = \{e \mapsto e \bigcirc s \mid \bigcirc \in O \land s \in S\}

and the set of all environment markers[4] can be recursively defined as

M={c}{(e1,,e11)m(e1,,e11)x(eı)mMxXıI}{(e1,,e11)m(e1,,e11)x(eı)mMxXıI}\begin{aligned} M = \{c_\top\} &\cup \{(e_1, \ldots, e_{11}) \mapsto m(e_1, \ldots, e_{11}) \land x(e_{\imath}) \mid m \in M \land x \in X \land \imath \in I\}\\ &\cup \{(e_1, \ldots, e_{11}) \mapsto m(e_1, \ldots, e_{11}) \lor x(e_{\imath}) \mid m \in M \land x \in X \land \imath \in I\} \end{aligned}

where \top is the top element in Boolean algebra, c:E{}c_\top: E \to \{\top\} is a constant function and I=[1,11]I = [1, 11]. With all Boolean expressions converted into one of their standard forms, we have the following alternative definitions of an environment marker

(e1,,e11)ı=1nȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ))(e1,,e11)ı=1nȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ))\begin{aligned} (e_1, \ldots, e_{11}) &\mapsto \bigvee_{\imath = 1}^n\bigwedge_{\jmath = 1}^{f(\imath)} x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\\ (e_1, \ldots, e_{11}) &\mapsto \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)} x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right) \end{aligned}

where nNn \in \N, f:N1N1f: \N_1 \to \N_1, x:N12Xx: \N_1^2 \to X, k:N12Ik: \N_1^2 \to I and an empty Boolean expression is implicitly cc_\top.

Now it is possible to formalize inclusion for markers as

{m,n}M:(mn)=(eE:m(e)n(e))\forall\,\{m, n\} \subset M: (m \le n) = (\forall\,e \in E: m(e) \le n(e))

Thus, checking if a marker includes another is equivalent to evaluating the following expression, for all (e1,,e11)E(e_1, \ldots, e_{11}) \in E

r=ı=1nȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ))ı=1nȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ))=ı=1n(ȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ))ı=1nȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ)))=ı=1nı=1n(ȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ))ȷ=1f(ı)x(ı,ȷ)(ek(ı,ȷ)))\begin{aligned} r &= \bigvee_{\imath' = 1}^{n'}\bigwedge_{\jmath' = 1}^{f'(\imath')} x'(\imath', \jmath')\left(e_{k'(\imath', \jmath')}\right) \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)} x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\\ &= \bigwedge_{\imath' = 1}^{n'}\left(\bigwedge_{\jmath' = 1}^{f'(\imath')} x'(\imath', \jmath')\left(e_{k'(\imath', \jmath')}\right) \le \bigwedge_{\imath = 1}^n\bigvee_{\jmath = 1}^{f(\imath)} x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\right)\\ &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n \left(\bigwedge_{\jmath' = 1}^{f'(\imath')} x'(\imath', \jmath')\left(e_{k'(\imath', \jmath')}\right) \le \bigvee_{\jmath = 1}^{f(\imath)} x(\imath, \jmath)\left(e_{k(\imath, \jmath)}\right)\right) \end{aligned}

where rr is whether the original marker (defined by nn, ff, xx and kk) includes the declared one (defined by nn', ff', xx' and kk'). Let

g(ı,)={ȷȷ[1,f(ı)]k(ı,ȷ)=}g(ı,)={ȷȷ[1,f(ı)]k(ı,ȷ)=}\begin{aligned} g(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f(\imath)] \land k(\imath, \jmath) = \ell\}\\ g'(\imath, \ell) &= \{\jmath \mid \jmath \in [1, f'(\imath)] \land k'(\imath, \jmath) = \ell\} \end{aligned}

and assume e1,,e11e_1, \ldots, e_{11} are orthogonal, we have

r=ı=1nı=1nI(ȷg(ı,)x(ı,ȷ)(e)ȷg(ı,)x(ı,ȷ)(e))=ı=1nı=1nI(ȷg(ı,)x(ı,ȷ)(e)¬ȷg(ı,)x(ı,ȷ)(e)=)\begin{aligned} r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I} \left(\bigwedge_{\jmath' \in g'(\imath', \ell)} x'(\imath', \jmath')\left(e_{\ell}\right) \le \bigvee_{\jmath \in g(\imath, \ell)} x(\imath, \jmath)\left(e_{\ell}\right)\right)\\ &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I} \left(\bigwedge_{\jmath' \in g'(\imath', \ell)} x'(\imath', \jmath')\left(e_{\ell}\right) \land \lnot \bigvee_{\jmath \in g(\imath, \ell)} x(\imath, \jmath)\left(e_{\ell}\right) = \bot\right) \end{aligned}

where \bot is the Boolean algebra bottom element. By De Morgan's laws,

r=ı=1nı=1nI(ȷg(ı,)x(ı,ȷ)(e)ȷg(ı,)¬x(ı,ȷ)(e)=)\begin{aligned} r &= \bigwedge_{\imath' = 1}^{n'}\bigwedge_{\imath = 1}^n\bigvee_{\ell \in I} \left(\bigwedge_{\jmath' \in g'(\imath', \ell)} x'(\imath', \jmath')\left(e_{\ell}\right) \land \bigwedge_{\jmath \in g(\imath, \ell)} \lnot x(\imath, \jmath)\left(e_{\ell}\right) = \bot\right) \end{aligned}

Essentially, checking if

ȷg(ı,)x(ı,ȷ)(e)ȷg(ı,)¬x(ı,ȷ)(e)\bigwedge_{\jmath' \in g'(\imath', \ell)} x'(\imath', \jmath')\left(e_{\ell}\right) \land \bigwedge_{\jmath \in g(\imath, \ell)} \lnot x(\imath, \jmath)\left(e_{\ell}\right)

is unsatisfiable for all ee_{\ell} is a SAT problem which we can write a solver for. Indeed, the original expression could be formulated as a SAT, but it would be more complex due to the multiple dimensions of environments.

Implementation

Generation

Conversion

Verification

[1] Built distributions, as opposed to sdist
[2] Bonus: try to sort manylinux1, manylinux_2_24 and manylinux2014
[3] Ignoring relations between two variables or two strings
[4] Or environment specifiers when not marking anything