468 lines
15 KiB
Markdown
468 lines
15 KiB
Markdown
---
|
|
title: "Python String Emptiness Test"
|
|
date: 2019-03-25T00:00:00+00:00
|
|
draft: false
|
|
categories: ["python"]
|
|
tags: [ "python", "secureprogramming" ]
|
|
lang: en
|
|
---
|
|
|
|
Security specialists have this tendency to focus on the most mundane things,
|
|
and overthink them to the point where they may actually find something
|
|
interesting to say about them.
|
|
|
|
Is that a useful use of their time? That question is open to debate ;p
|
|
|
|
But I believe writing about them IS a useful use of my time on a gray Sunday
|
|
morning!
|
|
|
|
# Python String Emptiness Test
|
|
|
|
A while back, I entered a lengthy debate about condition expressions in Python
|
|
with one of my colleagues during a merge conflict. We both wrote a utility
|
|
function expressing the exact same condition, but we had written it in a
|
|
different way.
|
|
|
|
He wrote this:
|
|
|
|
{{< highlight python >}}
|
|
def should_i_do_smth():
|
|
# ...
|
|
return bool(some_string) and some_int == 0
|
|
|
|
{{< /highlight >}}
|
|
|
|
I wrote that:
|
|
|
|
{{< highlight python >}}
|
|
def should_i_do_smth():
|
|
# ...
|
|
return len(some_string) != 0 and some_int == 0
|
|
{{< /highlight >}}
|
|
|
|
The only use case of that function was in an if condition:
|
|
|
|
{{< highlight python >}}
|
|
if should_i_do_smth():
|
|
|
|
{{< /highlight >}}
|
|
|
|
Whose code was merged is not really relevant, but I decided to create a Twitter
|
|
poll to probe my followers on the topic of writing emptiness tests of Python
|
|
strings.
|
|
|
|
The poll was:
|
|
|
|
> According to you, which of the four following lines of #Python code (poll) is
|
|
> the most *self-explanatory* and the *safest* to express that the string
|
|
> variable "value" is not empty? All answers are valid Python.
|
|
>
|
|
>* `if value:`
|
|
>* `if bool(value):`
|
|
>* `if len(value) != 0:`
|
|
>* `if value != '':`
|
|
|
|
## Regarding the poll itself
|
|
|
|
The wording of the poll is vague on purpose. Depending on the reader, some of
|
|
the words have different meanings. That was a deliberate choice. It is very
|
|
hard to express formal things in a natural language. So, even if you have clear
|
|
instructions, and documentation, and code comments, in the end, what really
|
|
matters is what was on the mind of the dev when they wrote that line of code.
|
|
|
|
People who answered the poll may have read differently:
|
|
|
|
* Safest: did I mean that the code must be robust (i.e. not crash) or secure
|
|
(i.e. act consistently and fail during static analysis)?
|
|
|
|
* String variable: did I mean `str` or `Optional[str]`, only Unicode string
|
|
(`''`) or also bytestring (`b''`)? Was there a test to control the type? Or
|
|
static typing analysis?
|
|
|
|
* Not empty: did I mean that we should refuse the empty string only, or should
|
|
we also refuse None?
|
|
|
|
* Self-explanatory: did I mean that the code should be readable by Python
|
|
developers abiding to PEP8 and the Zen, or should the code be readable by most
|
|
people, including security auditors or junior developers that may not be
|
|
knowledgeable of PEP8 best current practices. Is readable equivalent to
|
|
explicit and unambiguous?
|
|
|
|
Of course, the poll results are heavily biased. Most of my followers are
|
|
security specialists, not software engineers, and certainly not full-time
|
|
Python developers.
|
|
|
|
At any rate, I find the results quite interesting, because a majority of people
|
|
did NOT follow the PEP-8 recommendation when asked this question.
|
|
|
|
At the time of writing, with 54 voters, the poll results is:
|
|
|
|
* `if value:` 38%
|
|
* `if bool(value):` 4%
|
|
* `if len(value) != 0:` 34%
|
|
* `if value != '':` 24%
|
|
|
|
## Answer analysis
|
|
Each of the poll answers is valid Python, but they all also have merits.
|
|
|
|
### <a name="first"></a>First Answer
|
|
|
|
`if value:` might be readable by most full-time Python developers because it is
|
|
the Python way, recommended by PEP-8. It works because the empty string `''` is
|
|
evaluated to False in the context of a condition expression.
|
|
|
|
To some extent, it is comparable to the following C code :
|
|
|
|
{{< highlight c >}}
|
|
char * value;
|
|
... instructions ...
|
|
if (value && *value) {
|
|
{{< /highlight >}}
|
|
|
|
This code plays on the fact that a `NULL` pointer values 0, which is evaluated
|
|
to `False` in C. It also plays on the fact that if the string is empty, the
|
|
first character will be the `NULL` byte (`\0`), which is also evaluated to
|
|
`False`. This type of syntax requires your reader to be knowledgeable about the
|
|
language. Thus, if they are, it is very readable. If they are not, then the
|
|
jury is out.
|
|
|
|
But the real issue of this syntax is that there is absolutely no type checking
|
|
whatsoever. That's not really a problem if you embrace Python's duck typing and
|
|
you are well prepared for the consequences of that choice. If you only consider
|
|
the poll question as it is formulated, and the code is properly guarded so that
|
|
you are certain that the `value` variable is a Unicode string, this approach
|
|
bares no risks.
|
|
|
|
But what if I lied and the value variable is not a string? What if it is a
|
|
bytestring? The None value? An integer? An arbitrary object? What if the code
|
|
is not properly guarded?
|
|
|
|
I mean, this can happen easily when you refactor your code or when you meddle
|
|
with some code you don't fully understand. Sure, unit tests and static analysis
|
|
are valid answers to refactoring problems... but let's be honest here:
|
|
correctly written unit tests are not frequent in the startup industry.
|
|
|
|
And mistakes happen. One of my followers told me a "war story":
|
|
|
|
{{< highlight python >}}
|
|
if i:
|
|
{{< /highlight >}}
|
|
|
|
Yeah, that's it ^_^ A colleague of them wrote this test to guard against the
|
|
`None` value... but `i` was an integer. And some day, that integer equaled 0.
|
|
If you don't know, 0, in Python, is evaluated as `False` in a condition
|
|
expression. Oops.
|
|
|
|
But let's say that this integer war story is a corner case. What would happen
|
|
if the value variable was in fact an instance of a type implementing `__bool__`
|
|
or `__len__`?
|
|
|
|
In Python, truth evaluation on arbitrary object is done by trying to call
|
|
`__bool__` that should return the truth value for this object. If `__bool__` is
|
|
not implemented, `__len__` is tried. If `__len__` returns 0, then the
|
|
evaluation result is `False`, else `True`. Finally, if `__len__` is not
|
|
implemented, then it returns `True` for variable that are not `None` nor a
|
|
scalar that evaluates to `None`.
|
|
|
|
{{< highlight python >}}
|
|
class Test:
|
|
def __bool__(self):
|
|
print('>>Bool!')
|
|
return True
|
|
|
|
value = Test()
|
|
if value():
|
|
print('>>Hey!')
|
|
{{< /highlight >}}
|
|
|
|
```
|
|
>>Bool!
|
|
>>Hey!
|
|
```
|
|
|
|
----
|
|
|
|
{{< highlight python >}}
|
|
class Test:
|
|
def __len__(self):
|
|
print('>>Len!')
|
|
return 1
|
|
|
|
value = Test()
|
|
if value:
|
|
print('>>Hey!')
|
|
{{< /highlight >}}
|
|
|
|
```
|
|
>>Len!
|
|
>>Hey!
|
|
```
|
|
|
|
----
|
|
|
|
{{< highlight python >}}
|
|
class Test:
|
|
def __len__(self):
|
|
print('>>Len!')
|
|
return 0
|
|
|
|
value = Test()
|
|
if value():
|
|
print('>>Hey!')
|
|
{{< /highlight >}}
|
|
|
|
```
|
|
>>Len!
|
|
```
|
|
|
|
----
|
|
|
|
{{< highlight python >}}
|
|
class Test: pass
|
|
|
|
value = Test()
|
|
if value:
|
|
print('>>Hey!')
|
|
{{< /highlight >}}
|
|
|
|
```
|
|
>>Hey!
|
|
```
|
|
|
|
With such a high risk of confusion on the actual type of the object, I would
|
|
personally bet that there is a very high probability of having some unstable
|
|
code that will assume something about what the `value` variable is capable of,
|
|
resulting in the infamous `AttributeError: 'Foo' object has no attribute
|
|
'bar'`. Or worse: no crash and a very stupid result.
|
|
|
|
You understood it, this all boils down to the discussion about typing, and the
|
|
fact that the more checks (notably typing checks) are done statically, the more
|
|
robust the code. And that is a fact that many languages have caught on. PHP,
|
|
Python, Javascript, Perl, almost all interpreted languages (with the exception
|
|
of Ruby 😂) have adopted a type hinting syntax to help static analyzers
|
|
verify the code sanity. In Python, PEP-484 defines the syntax and you can use
|
|
MyPy to do the checking.
|
|
|
|
So, if you hinted that the type of value was `str` before the `if value:`
|
|
statement, then your code is probably safe and robust if you use a static
|
|
analyzer.
|
|
|
|
{{< highlight python >}}
|
|
def foo(value: str):
|
|
{{< /highlight >}}
|
|
|
|
If you do not have a type hint and you don't use a static analyzer, then your
|
|
code is not readable because your intention about what is acceptable is not
|
|
explicit. Thus, and your code is not robust/safe because you will probably end
|
|
up crashing.
|
|
|
|
### <a name="second"></a>Second Answer
|
|
|
|
The second answer is nonsensical and I admittedly did a poor job at
|
|
representing the opinion of my colleague when writing this poll. Let's put it
|
|
on the fact that I was forced to use their solution.
|
|
|
|
`if bool(value):` is stupid because we already are in the context of a boolean
|
|
evaluation, so the bool casting is redundant. However, their opinion is not
|
|
without ground in other contexts.
|
|
|
|
For instance, in the following code, what would you replace **XXX** with?
|
|
|
|
{{< highlight python >}}
|
|
def foo() -> Tuple[str, int]:
|
|
"""some instructions"""
|
|
# ...
|
|
|
|
def bar() -> XXX:
|
|
s, i = foo()
|
|
return s and i
|
|
{{< /highlight >}}
|
|
|
|
If you answered `bool`, you are wrong. **XXX** must be replaced by `Union[str,
|
|
int]`.
|
|
```
|
|
error: Incompatible return value type (got "Union[str, int]", expected "bool")
|
|
```
|
|
|
|
That's because Python logical operators do not return booleans! In the case of
|
|
the `and` operator, it returns the first value that cannot be evaluated to
|
|
`True`.
|
|
|
|
Thus, if you meant the bar function to return a boolean, you need to cast the
|
|
whole expression as bool before returning it.
|
|
|
|
{{< highlight python >}}
|
|
def bar() -> bool:
|
|
s, i = foo()
|
|
return bool(s and i)
|
|
{{< /highlight >}}
|
|
|
|
You can also only cast elements that are not comparisons, because, at least,
|
|
comparisons do return booleans:
|
|
|
|
{{< highlight python >}}
|
|
def bar() -> bool:
|
|
s, i = foo
|
|
return bool(s) and i != 0
|
|
{{< /highlight >}}
|
|
|
|
The fact that Python logical operators return arbitrary values is very
|
|
surprising to people that come from some other languages, such as C or even
|
|
PHP, where the logical operators return a truth value. For people coming from
|
|
very strongly typed languages such as Ocaml or Rust, this is difficult to even
|
|
consider, because logical operators are typed and would only accept booleans as
|
|
operands in the first place.
|
|
|
|
{{< highlight bash >}}
|
|
$ echo '''#include <stdio.h>
|
|
|
|
int main(){
|
|
printf("> %d\n", 2 && 3);
|
|
return 0;
|
|
}''' | gcc -o /tmp/condition -x c -
|
|
|
|
$ /tmp/condition
|
|
>1
|
|
|
|
$ php -r 'echo 2 && 3;'
|
|
1
|
|
{{< /highlight >}}
|
|
|
|
Once you consider that you need to explicitly cast to bool all of your values
|
|
in a logical expression to output a boolean, the readability and
|
|
robustness/safety arguments are identical to those of the first answer.
|
|
|
|
### <a name="third"></a>Third Answer
|
|
|
|
The third answer is the first that is not relying on the Python language
|
|
specification to express a condition. It explicitly states which particular
|
|
characteristic of the content of the `value` variable is studied and must not
|
|
be equal to 0 and it returns an unambiguous result: a truth value: True or
|
|
False.
|
|
|
|
Regarding code safety, the result is mixed. It is the only answer out of the
|
|
four that will crash the program during runtime if value is the None value.
|
|
|
|
```
|
|
TypeError: object of type 'NoneType' has no len()
|
|
```
|
|
|
|
Depending on the purpose of the program, crashing early may or may not be
|
|
desirable. A ex-colleague working on machine learning told me that they would
|
|
prefer to have a program crash early because of a type bug like this, than to
|
|
have the program run for hours only to crash later with a `AttributeError:
|
|
'NoneType' object has no attribute 'foo'`.
|
|
|
|
Of course, the best option is to test the program, and check the parameters
|
|
against the API contract so that this situation never occurs. But what if the
|
|
test is broken (for instance, because it was written with the syntax from the
|
|
first answer) and it let through a value that was not intended?
|
|
|
|
Unfortunately, this answer is also susceptible to cause late crashes because of
|
|
duck typing. The `len()` function is only a keyword that calls on the `__len__`
|
|
implementation of the receiver (that is the object provided as `len()`'s
|
|
argument. This means that this test won't guard against objects implementing
|
|
`__len__`, just like `str`, but not being a string. The perfect builtin example
|
|
would be the bytestring.
|
|
|
|
{{< highlight python >}}
|
|
value = b'bonjour'
|
|
if len(value) != 0:
|
|
l = a.split('o')
|
|
{{< /highlight >}}
|
|
|
|
```
|
|
TypeError: a bytes-like object is required, not 'str'
|
|
```
|
|
|
|
All in all, this answer is relatively explicit in what is tested and what's the
|
|
intended result. But even if an early crash might be desirable, a crash is not
|
|
certain because duck typing will let invalid objects go through.
|
|
|
|
### <a name="fourth"></a>Fourth Answer
|
|
|
|
The fourth answer is pretty straightforward. The meaning is crystal clear: it
|
|
evaluates to `True` as long as `value` is not the empty Unicode string. Anybody
|
|
gifted with eyes can read this expression and understand what is the intent of
|
|
the author.
|
|
|
|
It won't be abused by the empty bytestring. It won't be abused by the `None`
|
|
value. It won't be abused by objects implementing `__str__`.
|
|
|
|
{{< highlight python >}}
|
|
class Test:
|
|
def __str__(self) -> str:
|
|
return ""
|
|
|
|
value = Test()
|
|
if value != "":
|
|
print("Foo")
|
|
{{< /highlight >}}
|
|
|
|
```
|
|
Foo
|
|
```
|
|
|
|
Wait, what? It printed `Foo`?! Oh, yeah. Right. All objects of a different type
|
|
than str, and even objects of type str if they are not the empty string will
|
|
make this condition evaluate to True. They are not equal to the empty string.
|
|
That is exactly what is expressed by the condition.
|
|
|
|
If what you really mean is that the value must be a string and not the empty
|
|
string, then, you need to write an even more explicit version of this:
|
|
|
|
{{< highlight python >}}
|
|
if isinstance(value, str) and value != "":
|
|
{{< /highlight >}}
|
|
|
|
(note: using `isinstance()` is prefered to `type(value) is str` just in case
|
|
someone got the extra-weird idea to subclass `str`; don't do that.
|
|
Unfortunately, `type(value) is str` would have been so much more readable)
|
|
|
|
## A Word About Efficiency
|
|
|
|
To my very own surprise, there is a significant difference of performance
|
|
between the four answers. The slowest take twice as much time as the fastest
|
|
(2.19 times more, to be precise).
|
|
|
|
It is in the realm of micro-optimizations but since the difference is so large,
|
|
I think the result is worth including in this post.
|
|
|
|
For 10 million iterations, run 100 times to clear out statistical noise, the
|
|
average results are:
|
|
|
|
Answer | Time | Ratio
|
|
--------|-------|-------
|
|
First | 1.28s | 1
|
|
Second | 2.81s | 2.19
|
|
Third | 1.92s | 1.5
|
|
Fourth | 1.52s | 1.19
|
|
|
|
# Conclusion
|
|
|
|
As I stated in the poll itself, all four of the answers are totally valid
|
|
Python code. However some answers will make full-time Python developers abiding
|
|
by the PEP-8 recommendations cringe. Some even implemented tools that would
|
|
whine specifically on one of the answers. That's the case of the third answer
|
|
that will make pylint, a Python code linter, complain about the use of `len()`
|
|
to test if an sequence is empty (because a `str` is also an character
|
|
sequence):
|
|
|
|
```
|
|
Do not use `len(SEQUENCE)` to determine if a sequence is empty (len-as-condition)
|
|
```
|
|
|
|
We also saw during the discussion about the advantages and drawbacks of the
|
|
first answer that if one makes a proper use of type hinting and static code
|
|
analysis (e.g. mypy), there may be little code safety problems with it; just a
|
|
readability issue because the intent of the developers about which use cases
|
|
are covered remains obscure.
|
|
|
|
Overall, my personal belief is that the fourth answer (especially when guarded
|
|
with the `isinstance()` call) should be prefered because the developer's intent
|
|
is explicit and one gets exactly what they asked for, without any of the magic
|
|
offered by the language.
|
|
|
|
---
|
|
Please post your comments as answers on [Twitter](https://twitter.com/X_Cli_Public/status/1110275080221806593) or on the [Fediverse](https://infosec.exchange/@x_cli/101813137227224905)!
|
|
|