Broken-by-Design/posts/python-string-emptiness.md

469 lines
15 KiB
Markdown
Raw Permalink Normal View History

2022-05-25 10:59:36 +00:00
---
title: "Python String Emptiness Test"
date: 2019-03-25T00:00:00+00:00
draft: false
categories: ["python"]
tags: [ "python", "secureprogramming" ]
2022-05-28 11:05:54 +00:00
lang: en
2022-05-25 10:59:36 +00:00
---
Security specialists have this tendency to focus on the most mundane things,
and overthink them to the point where they may actually find something
interesting to say about them.
Is that a useful use of their time? That question is open to debate ;p
But I believe writing about them IS a useful use of my time on a gray Sunday
morning!
# Python String Emptiness Test
A while back, I entered a lengthy debate about condition expressions in Python
with one of my colleagues during a merge conflict. We both wrote a utility
function expressing the exact same condition, but we had written it in a
different way.
He wrote this:
{{< highlight python >}}
def should_i_do_smth():
# ...
return bool(some_string) and some_int == 0
{{< /highlight >}}
I wrote that:
{{< highlight python >}}
def should_i_do_smth():
# ...
return len(some_string) != 0 and some_int == 0
{{< /highlight >}}
The only use case of that function was in an if condition:
{{< highlight python >}}
if should_i_do_smth():
{{< /highlight >}}
Whose code was merged is not really relevant, but I decided to create a Twitter
poll to probe my followers on the topic of writing emptiness tests of Python
strings.
The poll was:
> According to you, which of the four following lines of #Python code (poll) is
> the most *self-explanatory* and the *safest* to express that the string
> variable "value" is not empty? All answers are valid Python.
>
>* `if value:`
>* `if bool(value):`
>* `if len(value) != 0:`
>* `if value != '':`
## Regarding the poll itself
The wording of the poll is vague on purpose. Depending on the reader, some of
the words have different meanings. That was a deliberate choice. It is very
hard to express formal things in a natural language. So, even if you have clear
instructions, and documentation, and code comments, in the end, what really
matters is what was on the mind of the dev when they wrote that line of code.
People who answered the poll may have read differently:
* Safest: did I mean that the code must be robust (i.e. not crash) or secure
(i.e. act consistently and fail during static analysis)?
* String variable: did I mean `str` or `Optional[str]`, only Unicode string
(`''`) or also bytestring (`b''`)? Was there a test to control the type? Or
static typing analysis?
* Not empty: did I mean that we should refuse the empty string only, or should
we also refuse None?
* Self-explanatory: did I mean that the code should be readable by Python
developers abiding to PEP8 and the Zen, or should the code be readable by most
people, including security auditors or junior developers that may not be
knowledgeable of PEP8 best current practices. Is readable equivalent to
explicit and unambiguous?
Of course, the poll results are heavily biased. Most of my followers are
security specialists, not software engineers, and certainly not full-time
Python developers.
At any rate, I find the results quite interesting, because a majority of people
did NOT follow the PEP-8 recommendation when asked this question.
At the time of writing, with 54 voters, the poll results is:
* `if value:` 38%
* `if bool(value):` 4%
* `if len(value) != 0:` 34%
* `if value != '':` 24%
## Answer analysis
Each of the poll answers is valid Python, but they all also have merits.
### <a name="first"></a>First Answer
`if value:` might be readable by most full-time Python developers because it is
the Python way, recommended by PEP-8. It works because the empty string `''` is
evaluated to False in the context of a condition expression.
To some extent, it is comparable to the following C code :
{{< highlight c >}}
char * value;
... instructions ...
if (value && *value) {
{{< /highlight >}}
This code plays on the fact that a `NULL` pointer values 0, which is evaluated
to `False` in C. It also plays on the fact that if the string is empty, the
first character will be the `NULL` byte (`\0`), which is also evaluated to
`False`. This type of syntax requires your reader to be knowledgeable about the
language. Thus, if they are, it is very readable. If they are not, then the
jury is out.
But the real issue of this syntax is that there is absolutely no type checking
whatsoever. That's not really a problem if you embrace Python's duck typing and
you are well prepared for the consequences of that choice. If you only consider
the poll question as it is formulated, and the code is properly guarded so that
you are certain that the `value` variable is a Unicode string, this approach
bares no risks.
But what if I lied and the value variable is not a string? What if it is a
bytestring? The None value? An integer? An arbitrary object? What if the code
is not properly guarded?
I mean, this can happen easily when you refactor your code or when you meddle
with some code you don't fully understand. Sure, unit tests and static analysis
are valid answers to refactoring problems... but let's be honest here:
correctly written unit tests are not frequent in the startup industry.
And mistakes happen. One of my followers told me a "war story":
{{< highlight python >}}
if i:
{{< /highlight >}}
Yeah, that's it ^_^ A colleague of them wrote this test to guard against the
`None` value... but `i` was an integer. And some day, that integer equaled 0.
If you don't know, 0, in Python, is evaluated as `False` in a condition
expression. Oops.
But let's say that this integer war story is a corner case. What would happen
if the value variable was in fact an instance of a type implementing `__bool__`
or `__len__`?
In Python, truth evaluation on arbitrary object is done by trying to call
`__bool__` that should return the truth value for this object. If `__bool__` is
not implemented, `__len__` is tried. If `__len__` returns 0, then the
evaluation result is `False`, else `True`. Finally, if `__len__` is not
implemented, then it returns `True` for variable that are not `None` nor a
scalar that evaluates to `None`.
{{< highlight python >}}
class Test:
def __bool__(self):
print('>>Bool!')
return True
value = Test()
if value():
print('>>Hey!')
{{< /highlight >}}
```
>>Bool!
>>Hey!
```
----
{{< highlight python >}}
class Test:
def __len__(self):
print('>>Len!')
return 1
value = Test()
if value:
print('>>Hey!')
{{< /highlight >}}
```
>>Len!
>>Hey!
```
----
{{< highlight python >}}
class Test:
def __len__(self):
print('>>Len!')
return 0
value = Test()
if value():
print('>>Hey!')
{{< /highlight >}}
```
>>Len!
```
----
{{< highlight python >}}
class Test: pass
value = Test()
if value:
print('>>Hey!')
{{< /highlight >}}
```
>>Hey!
```
With such a high risk of confusion on the actual type of the object, I would
personally bet that there is a very high probability of having some unstable
code that will assume something about what the `value` variable is capable of,
resulting in the infamous `AttributeError: 'Foo' object has no attribute
'bar'`. Or worse: no crash and a very stupid result.
You understood it, this all boils down to the discussion about typing, and the
fact that the more checks (notably typing checks) are done statically, the more
robust the code. And that is a fact that many languages have caught on. PHP,
Python, Javascript, Perl, almost all interpreted languages (with the exception
of Ruby &#128514;) have adopted a type hinting syntax to help static analyzers
verify the code sanity. In Python, PEP-484 defines the syntax and you can use
MyPy to do the checking.
So, if you hinted that the type of value was `str` before the `if value:`
statement, then your code is probably safe and robust if you use a static
analyzer.
{{< highlight python >}}
def foo(value: str):
{{< /highlight >}}
If you do not have a type hint and you don't use a static analyzer, then your
code is not readable because your intention about what is acceptable is not
explicit. Thus, and your code is not robust/safe because you will probably end
up crashing.
### <a name="second"></a>Second Answer
The second answer is nonsensical and I admittedly did a poor job at
representing the opinion of my colleague when writing this poll. Let's put it
on the fact that I was forced to use their solution.
`if bool(value):` is stupid because we already are in the context of a boolean
evaluation, so the bool casting is redundant. However, their opinion is not
without ground in other contexts.
For instance, in the following code, what would you replace **XXX** with?
{{< highlight python >}}
def foo() -> Tuple[str, int]:
"""some instructions"""
# ...
def bar() -> XXX:
s, i = foo()
return s and i
{{< /highlight >}}
If you answered `bool`, you are wrong. **XXX** must be replaced by `Union[str,
int]`.
```
error: Incompatible return value type (got "Union[str, int]", expected "bool")
```
That's because Python logical operators do not return booleans! In the case of
the `and` operator, it returns the first value that cannot be evaluated to
`True`.
Thus, if you meant the bar function to return a boolean, you need to cast the
whole expression as bool before returning it.
{{< highlight python >}}
def bar() -> bool:
s, i = foo()
return bool(s and i)
{{< /highlight >}}
You can also only cast elements that are not comparisons, because, at least,
comparisons do return booleans:
{{< highlight python >}}
def bar() -> bool:
s, i = foo
return bool(s) and i != 0
{{< /highlight >}}
The fact that Python logical operators return arbitrary values is very
surprising to people that come from some other languages, such as C or even
PHP, where the logical operators return a truth value. For people coming from
very strongly typed languages such as Ocaml or Rust, this is difficult to even
consider, because logical operators are typed and would only accept booleans as
operands in the first place.
{{< highlight bash >}}
$ echo '''#include <stdio.h>
int main(){
printf("> %d\n", 2 && 3);
return 0;
}''' | gcc -o /tmp/condition -x c -
$ /tmp/condition
>1
$ php -r 'echo 2 && 3;'
1
{{< /highlight >}}
Once you consider that you need to explicitly cast to bool all of your values
in a logical expression to output a boolean, the readability and
robustness/safety arguments are identical to those of the first answer.
### <a name="third"></a>Third Answer
The third answer is the first that is not relying on the Python language
specification to express a condition. It explicitly states which particular
characteristic of the content of the `value` variable is studied and must not
be equal to 0 and it returns an unambiguous result: a truth value: True or
False.
Regarding code safety, the result is mixed. It is the only answer out of the
four that will crash the program during runtime if value is the None value.
```
TypeError: object of type 'NoneType' has no len()
```
Depending on the purpose of the program, crashing early may or may not be
desirable. A ex-colleague working on machine learning told me that they would
prefer to have a program crash early because of a type bug like this, than to
have the program run for hours only to crash later with a `AttributeError:
'NoneType' object has no attribute 'foo'`.
Of course, the best option is to test the program, and check the parameters
against the API contract so that this situation never occurs. But what if the
test is broken (for instance, because it was written with the syntax from the
first answer) and it let through a value that was not intended?
Unfortunately, this answer is also susceptible to cause late crashes because of
duck typing. The `len()` function is only a keyword that calls on the `__len__`
implementation of the receiver (that is the object provided as `len()`'s
argument. This means that this test won't guard against objects implementing
`__len__`, just like `str`, but not being a string. The perfect builtin example
would be the bytestring.
{{< highlight python >}}
value = b'bonjour'
if len(value) != 0:
l = a.split('o')
{{< /highlight >}}
```
TypeError: a bytes-like object is required, not 'str'
```
All in all, this answer is relatively explicit in what is tested and what's the
intended result. But even if an early crash might be desirable, a crash is not
certain because duck typing will let invalid objects go through.
### <a name="fourth"></a>Fourth Answer
The fourth answer is pretty straightforward. The meaning is crystal clear: it
evaluates to `True` as long as `value` is not the empty Unicode string. Anybody
gifted with eyes can read this expression and understand what is the intent of
the author.
It won't be abused by the empty bytestring. It won't be abused by the `None`
value. It won't be abused by objects implementing `__str__`.
{{< highlight python >}}
class Test:
def __str__(self) -> str:
return ""
value = Test()
if value != "":
print("Foo")
{{< /highlight >}}
```
Foo
```
Wait, what? It printed `Foo`?! Oh, yeah. Right. All objects of a different type
than str, and even objects of type str if they are not the empty string will
make this condition evaluate to True. They are not equal to the empty string.
That is exactly what is expressed by the condition.
If what you really mean is that the value must be a string and not the empty
string, then, you need to write an even more explicit version of this:
{{< highlight python >}}
if isinstance(value, str) and value != "":
{{< /highlight >}}
(note: using `isinstance()` is prefered to `type(value) is str` just in case
someone got the extra-weird idea to subclass `str`; don't do that.
Unfortunately, `type(value) is str` would have been so much more readable)
## A Word About Efficiency
To my very own surprise, there is a significant difference of performance
between the four answers. The slowest take twice as much time as the fastest
(2.19 times more, to be precise).
It is in the realm of micro-optimizations but since the difference is so large,
I think the result is worth including in this post.
For 10 million iterations, run 100 times to clear out statistical noise, the
average results are:
Answer | Time | Ratio
--------|-------|-------
First | 1.28s | 1
Second | 2.81s | 2.19
Third | 1.92s | 1.5
Fourth | 1.52s | 1.19
# Conclusion
As I stated in the poll itself, all four of the answers are totally valid
Python code. However some answers will make full-time Python developers abiding
by the PEP-8 recommendations cringe. Some even implemented tools that would
whine specifically on one of the answers. That's the case of the third answer
that will make pylint, a Python code linter, complain about the use of `len()`
to test if an sequence is empty (because a `str` is also an character
sequence):
```
Do not use `len(SEQUENCE)` to determine if a sequence is empty (len-as-condition)
```
We also saw during the discussion about the advantages and drawbacks of the
first answer that if one makes a proper use of type hinting and static code
analysis (e.g. mypy), there may be little code safety problems with it; just a
readability issue because the intent of the developers about which use cases
are covered remains obscure.
Overall, my personal belief is that the fourth answer (especially when guarded
with the `isinstance()` call) should be prefered because the developer's intent
is explicit and one gets exactly what they asked for, without any of the magic
offered by the language.
---
Please post your comments as answers on [Twitter](https://twitter.com/X_Cli_Public/status/1110275080221806593) or on the [Fediverse](https://infosec.exchange/@x_cli/101813137227224905)!