15 KiB
title | date | draft | categories | tags | |||
---|---|---|---|---|---|---|---|
Python String Emptiness Test | 2019-03-25T00:00:00+00:00 | false |
|
|
Security specialists have this tendency to focus on the most mundane things, and overthink them to the point where they may actually find something interesting to say about them.
Is that a useful use of their time? That question is open to debate ;p
But I believe writing about them IS a useful use of my time on a gray Sunday morning!
Python String Emptiness Test
A while back, I entered a lengthy debate about condition expressions in Python with one of my colleagues during a merge conflict. We both wrote a utility function expressing the exact same condition, but we had written it in a different way.
He wrote this:
{{< highlight python >}} def should_i_do_smth(): # ... return bool(some_string) and some_int == 0
{{< /highlight >}}
I wrote that:
{{< highlight python >}} def should_i_do_smth(): # ... return len(some_string) != 0 and some_int == 0 {{< /highlight >}}
The only use case of that function was in an if condition:
{{< highlight python >}} if should_i_do_smth():
{{< /highlight >}}
Whose code was merged is not really relevant, but I decided to create a Twitter poll to probe my followers on the topic of writing emptiness tests of Python strings.
The poll was:
According to you, which of the four following lines of #Python code (poll) is the most self-explanatory and the safest to express that the string variable "value" is not empty? All answers are valid Python.
if value:
if bool(value):
if len(value) != 0:
if value != '':
Regarding the poll itself
The wording of the poll is vague on purpose. Depending on the reader, some of the words have different meanings. That was a deliberate choice. It is very hard to express formal things in a natural language. So, even if you have clear instructions, and documentation, and code comments, in the end, what really matters is what was on the mind of the dev when they wrote that line of code.
People who answered the poll may have read differently:
-
Safest: did I mean that the code must be robust (i.e. not crash) or secure (i.e. act consistently and fail during static analysis)?
-
String variable: did I mean
str
orOptional[str]
, only Unicode string (''
) or also bytestring (b''
)? Was there a test to control the type? Or static typing analysis? -
Not empty: did I mean that we should refuse the empty string only, or should we also refuse None?
-
Self-explanatory: did I mean that the code should be readable by Python developers abiding to PEP8 and the Zen, or should the code be readable by most people, including security auditors or junior developers that may not be knowledgeable of PEP8 best current practices. Is readable equivalent to explicit and unambiguous?
Of course, the poll results are heavily biased. Most of my followers are security specialists, not software engineers, and certainly not full-time Python developers.
At any rate, I find the results quite interesting, because a majority of people did NOT follow the PEP-8 recommendation when asked this question.
At the time of writing, with 54 voters, the poll results is:
if value:
38%if bool(value):
4%if len(value) != 0:
34%if value != '':
24%
Answer analysis
Each of the poll answers is valid Python, but they all also have merits.
First Answer
if value:
might be readable by most full-time Python developers because it is
the Python way, recommended by PEP-8. It works because the empty string ''
is
evaluated to False in the context of a condition expression.
To some extent, it is comparable to the following C code :
{{< highlight c >}} char * value; ... instructions ... if (value && *value) { {{< /highlight >}}
This code plays on the fact that a NULL
pointer values 0, which is evaluated
to False
in C. It also plays on the fact that if the string is empty, the
first character will be the NULL
byte (\0
), which is also evaluated to
False
. This type of syntax requires your reader to be knowledgeable about the
language. Thus, if they are, it is very readable. If they are not, then the
jury is out.
But the real issue of this syntax is that there is absolutely no type checking
whatsoever. That's not really a problem if you embrace Python's duck typing and
you are well prepared for the consequences of that choice. If you only consider
the poll question as it is formulated, and the code is properly guarded so that
you are certain that the value
variable is a Unicode string, this approach
bares no risks.
But what if I lied and the value variable is not a string? What if it is a bytestring? The None value? An integer? An arbitrary object? What if the code is not properly guarded?
I mean, this can happen easily when you refactor your code or when you meddle with some code you don't fully understand. Sure, unit tests and static analysis are valid answers to refactoring problems... but let's be honest here: correctly written unit tests are not frequent in the startup industry.
And mistakes happen. One of my followers told me a "war story":
{{< highlight python >}} if i: {{< /highlight >}}
Yeah, that's it ^_^ A colleague of them wrote this test to guard against the
None
value... but i
was an integer. And some day, that integer equaled 0.
If you don't know, 0, in Python, is evaluated as False
in a condition
expression. Oops.
But let's say that this integer war story is a corner case. What would happen
if the value variable was in fact an instance of a type implementing __bool__
or __len__
?
In Python, truth evaluation on arbitrary object is done by trying to call
__bool__
that should return the truth value for this object. If __bool__
is
not implemented, __len__
is tried. If __len__
returns 0, then the
evaluation result is False
, else True
. Finally, if __len__
is not
implemented, then it returns True
for variable that are not None
nor a
scalar that evaluates to None
.
{{< highlight python >}} class Test: def bool(self): print('>>Bool!') return True
value = Test() if value(): print('>>Hey!') {{< /highlight >}}
>>Bool!
>>Hey!
{{< highlight python >}} class Test: def len(self): print('>>Len!') return 1
value = Test() if value: print('>>Hey!') {{< /highlight >}}
>>Len!
>>Hey!
{{< highlight python >}} class Test: def len(self): print('>>Len!') return 0
value = Test() if value(): print('>>Hey!') {{< /highlight >}}
>>Len!
{{< highlight python >}} class Test: pass
value = Test() if value: print('>>Hey!') {{< /highlight >}}
>>Hey!
With such a high risk of confusion on the actual type of the object, I would
personally bet that there is a very high probability of having some unstable
code that will assume something about what the value
variable is capable of,
resulting in the infamous AttributeError: 'Foo' object has no attribute 'bar'
. Or worse: no crash and a very stupid result.
You understood it, this all boils down to the discussion about typing, and the fact that the more checks (notably typing checks) are done statically, the more robust the code. And that is a fact that many languages have caught on. PHP, Python, Javascript, Perl, almost all interpreted languages (with the exception of Ruby 😂) have adopted a type hinting syntax to help static analyzers verify the code sanity. In Python, PEP-484 defines the syntax and you can use MyPy to do the checking.
So, if you hinted that the type of value was str
before the if value:
statement, then your code is probably safe and robust if you use a static
analyzer.
{{< highlight python >}} def foo(value: str): {{< /highlight >}}
If you do not have a type hint and you don't use a static analyzer, then your code is not readable because your intention about what is acceptable is not explicit. Thus, and your code is not robust/safe because you will probably end up crashing.
Second Answer
The second answer is nonsensical and I admittedly did a poor job at representing the opinion of my colleague when writing this poll. Let's put it on the fact that I was forced to use their solution.
if bool(value):
is stupid because we already are in the context of a boolean
evaluation, so the bool casting is redundant. However, their opinion is not
without ground in other contexts.
For instance, in the following code, what would you replace XXX with?
{{< highlight python >}} def foo() -> Tuple[str, int]: """some instructions""" # ...
def bar() -> XXX: s, i = foo() return s and i {{< /highlight >}}
If you answered bool
, you are wrong. XXX must be replaced by Union[str, int]
.
error: Incompatible return value type (got "Union[str, int]", expected "bool")
That's because Python logical operators do not return booleans! In the case of
the and
operator, it returns the first value that cannot be evaluated to
True
.
Thus, if you meant the bar function to return a boolean, you need to cast the whole expression as bool before returning it.
{{< highlight python >}} def bar() -> bool: s, i = foo() return bool(s and i) {{< /highlight >}}
You can also only cast elements that are not comparisons, because, at least, comparisons do return booleans:
{{< highlight python >}} def bar() -> bool: s, i = foo return bool(s) and i != 0 {{< /highlight >}}
The fact that Python logical operators return arbitrary values is very surprising to people that come from some other languages, such as C or even PHP, where the logical operators return a truth value. For people coming from very strongly typed languages such as Ocaml or Rust, this is difficult to even consider, because logical operators are typed and would only accept booleans as operands in the first place.
{{< highlight bash >}} $ echo '''#include <stdio.h>
int main(){ printf("> %d\n", 2 && 3); return 0; }''' | gcc -o /tmp/condition -x c -
$ /tmp/condition
1
$ php -r 'echo 2 && 3;' 1 {{< /highlight >}}
Once you consider that you need to explicitly cast to bool all of your values in a logical expression to output a boolean, the readability and robustness/safety arguments are identical to those of the first answer.
Third Answer
The third answer is the first that is not relying on the Python language
specification to express a condition. It explicitly states which particular
characteristic of the content of the value
variable is studied and must not
be equal to 0 and it returns an unambiguous result: a truth value: True or
False.
Regarding code safety, the result is mixed. It is the only answer out of the four that will crash the program during runtime if value is the None value.
TypeError: object of type 'NoneType' has no len()
Depending on the purpose of the program, crashing early may or may not be
desirable. A ex-colleague working on machine learning told me that they would
prefer to have a program crash early because of a type bug like this, than to
have the program run for hours only to crash later with a AttributeError: 'NoneType' object has no attribute 'foo'
.
Of course, the best option is to test the program, and check the parameters against the API contract so that this situation never occurs. But what if the test is broken (for instance, because it was written with the syntax from the first answer) and it let through a value that was not intended?
Unfortunately, this answer is also susceptible to cause late crashes because of
duck typing. The len()
function is only a keyword that calls on the __len__
implementation of the receiver (that is the object provided as len()
's
argument. This means that this test won't guard against objects implementing
__len__
, just like str
, but not being a string. The perfect builtin example
would be the bytestring.
{{< highlight python >}} value = b'bonjour' if len(value) != 0: l = a.split('o') {{< /highlight >}}
TypeError: a bytes-like object is required, not 'str'
All in all, this answer is relatively explicit in what is tested and what's the intended result. But even if an early crash might be desirable, a crash is not certain because duck typing will let invalid objects go through.
Fourth Answer
The fourth answer is pretty straightforward. The meaning is crystal clear: it
evaluates to True
as long as value
is not the empty Unicode string. Anybody
gifted with eyes can read this expression and understand what is the intent of
the author.
It won't be abused by the empty bytestring. It won't be abused by the None
value. It won't be abused by objects implementing __str__
.
{{< highlight python >}} class Test: def str(self) -> str: return ""
value = Test() if value != "": print("Foo") {{< /highlight >}}
Foo
Wait, what? It printed Foo
?! Oh, yeah. Right. All objects of a different type
than str, and even objects of type str if they are not the empty string will
make this condition evaluate to True. They are not equal to the empty string.
That is exactly what is expressed by the condition.
If what you really mean is that the value must be a string and not the empty string, then, you need to write an even more explicit version of this:
{{< highlight python >}} if isinstance(value, str) and value != "": {{< /highlight >}}
(note: using isinstance()
is prefered to type(value) is str
just in case
someone got the extra-weird idea to subclass str
; don't do that.
Unfortunately, type(value) is str
would have been so much more readable)
A Word About Efficiency
To my very own surprise, there is a significant difference of performance between the four answers. The slowest take twice as much time as the fastest (2.19 times more, to be precise).
It is in the realm of micro-optimizations but since the difference is so large, I think the result is worth including in this post.
For 10 million iterations, run 100 times to clear out statistical noise, the average results are:
Answer | Time | Ratio |
---|---|---|
First | 1.28s | 1 |
Second | 2.81s | 2.19 |
Third | 1.92s | 1.5 |
Fourth | 1.52s | 1.19 |
Conclusion
As I stated in the poll itself, all four of the answers are totally valid
Python code. However some answers will make full-time Python developers abiding
by the PEP-8 recommendations cringe. Some even implemented tools that would
whine specifically on one of the answers. That's the case of the third answer
that will make pylint, a Python code linter, complain about the use of len()
to test if an sequence is empty (because a str
is also an character
sequence):
Do not use `len(SEQUENCE)` to determine if a sequence is empty (len-as-condition)
We also saw during the discussion about the advantages and drawbacks of the first answer that if one makes a proper use of type hinting and static code analysis (e.g. mypy), there may be little code safety problems with it; just a readability issue because the intent of the developers about which use cases are covered remains obscure.
Overall, my personal belief is that the fourth answer (especially when guarded
with the isinstance()
call) should be prefered because the developer's intent
is explicit and one gets exactly what they asked for, without any of the magic
offered by the language.
Please post your comments as answers on Twitter or on the Fediverse!