Broken-by-Design/posts/python-string-emptiness.md

15 KiB

title date draft categories tags lang
Python String Emptiness Test 2019-03-25T00:00:00+00:00 false
python
python
secureprogramming
en

Security specialists have this tendency to focus on the most mundane things, and overthink them to the point where they may actually find something interesting to say about them.

Is that a useful use of their time? That question is open to debate ;p

But I believe writing about them IS a useful use of my time on a gray Sunday morning!

Python String Emptiness Test

A while back, I entered a lengthy debate about condition expressions in Python with one of my colleagues during a merge conflict. We both wrote a utility function expressing the exact same condition, but we had written it in a different way.

He wrote this:

{{< highlight python >}} def should_i_do_smth(): # ... return bool(some_string) and some_int == 0

{{< /highlight >}}

I wrote that:

{{< highlight python >}} def should_i_do_smth(): # ... return len(some_string) != 0 and some_int == 0 {{< /highlight >}}

The only use case of that function was in an if condition:

{{< highlight python >}} if should_i_do_smth():

{{< /highlight >}}

Whose code was merged is not really relevant, but I decided to create a Twitter poll to probe my followers on the topic of writing emptiness tests of Python strings.

The poll was:

According to you, which of the four following lines of #Python code (poll) is the most self-explanatory and the safest to express that the string variable "value" is not empty? All answers are valid Python.

  • if value:
  • if bool(value):
  • if len(value) != 0:
  • if value != '':

Regarding the poll itself

The wording of the poll is vague on purpose. Depending on the reader, some of the words have different meanings. That was a deliberate choice. It is very hard to express formal things in a natural language. So, even if you have clear instructions, and documentation, and code comments, in the end, what really matters is what was on the mind of the dev when they wrote that line of code.

People who answered the poll may have read differently:

  • Safest: did I mean that the code must be robust (i.e. not crash) or secure (i.e. act consistently and fail during static analysis)?

  • String variable: did I mean str or Optional[str], only Unicode string ('') or also bytestring (b'')? Was there a test to control the type? Or static typing analysis?

  • Not empty: did I mean that we should refuse the empty string only, or should we also refuse None?

  • Self-explanatory: did I mean that the code should be readable by Python developers abiding to PEP8 and the Zen, or should the code be readable by most people, including security auditors or junior developers that may not be knowledgeable of PEP8 best current practices. Is readable equivalent to explicit and unambiguous?

Of course, the poll results are heavily biased. Most of my followers are security specialists, not software engineers, and certainly not full-time Python developers.

At any rate, I find the results quite interesting, because a majority of people did NOT follow the PEP-8 recommendation when asked this question.

At the time of writing, with 54 voters, the poll results is:

  • if value: 38%
  • if bool(value): 4%
  • if len(value) != 0: 34%
  • if value != '': 24%

Answer analysis

Each of the poll answers is valid Python, but they all also have merits.

First Answer

if value: might be readable by most full-time Python developers because it is the Python way, recommended by PEP-8. It works because the empty string '' is evaluated to False in the context of a condition expression.

To some extent, it is comparable to the following C code :

{{< highlight c >}} char * value; ... instructions ... if (value && *value) { {{< /highlight >}}

This code plays on the fact that a NULL pointer values 0, which is evaluated to False in C. It also plays on the fact that if the string is empty, the first character will be the NULL byte (\0), which is also evaluated to False. This type of syntax requires your reader to be knowledgeable about the language. Thus, if they are, it is very readable. If they are not, then the jury is out.

But the real issue of this syntax is that there is absolutely no type checking whatsoever. That's not really a problem if you embrace Python's duck typing and you are well prepared for the consequences of that choice. If you only consider the poll question as it is formulated, and the code is properly guarded so that you are certain that the value variable is a Unicode string, this approach bares no risks.

But what if I lied and the value variable is not a string? What if it is a bytestring? The None value? An integer? An arbitrary object? What if the code is not properly guarded?

I mean, this can happen easily when you refactor your code or when you meddle with some code you don't fully understand. Sure, unit tests and static analysis are valid answers to refactoring problems... but let's be honest here: correctly written unit tests are not frequent in the startup industry.

And mistakes happen. One of my followers told me a "war story":

{{< highlight python >}} if i: {{< /highlight >}}

Yeah, that's it ^_^ A colleague of them wrote this test to guard against the None value... but i was an integer. And some day, that integer equaled 0. If you don't know, 0, in Python, is evaluated as False in a condition expression. Oops.

But let's say that this integer war story is a corner case. What would happen if the value variable was in fact an instance of a type implementing __bool__ or __len__?

In Python, truth evaluation on arbitrary object is done by trying to call __bool__ that should return the truth value for this object. If __bool__ is not implemented, __len__ is tried. If __len__ returns 0, then the evaluation result is False, else True. Finally, if __len__ is not implemented, then it returns True for variable that are not None nor a scalar that evaluates to None.

{{< highlight python >}} class Test: def bool(self): print('>>Bool!') return True

value = Test() if value(): print('>>Hey!') {{< /highlight >}}

>>Bool!
>>Hey!

{{< highlight python >}} class Test: def len(self): print('>>Len!') return 1

value = Test() if value: print('>>Hey!') {{< /highlight >}}

>>Len!
>>Hey!

{{< highlight python >}} class Test: def len(self): print('>>Len!') return 0

value = Test() if value(): print('>>Hey!') {{< /highlight >}}

>>Len!

{{< highlight python >}} class Test: pass

value = Test() if value: print('>>Hey!') {{< /highlight >}}

>>Hey!

With such a high risk of confusion on the actual type of the object, I would personally bet that there is a very high probability of having some unstable code that will assume something about what the value variable is capable of, resulting in the infamous AttributeError: 'Foo' object has no attribute 'bar'. Or worse: no crash and a very stupid result.

You understood it, this all boils down to the discussion about typing, and the fact that the more checks (notably typing checks) are done statically, the more robust the code. And that is a fact that many languages have caught on. PHP, Python, Javascript, Perl, almost all interpreted languages (with the exception of Ruby 😂) have adopted a type hinting syntax to help static analyzers verify the code sanity. In Python, PEP-484 defines the syntax and you can use MyPy to do the checking.

So, if you hinted that the type of value was str before the if value: statement, then your code is probably safe and robust if you use a static analyzer.

{{< highlight python >}} def foo(value: str): {{< /highlight >}}

If you do not have a type hint and you don't use a static analyzer, then your code is not readable because your intention about what is acceptable is not explicit. Thus, and your code is not robust/safe because you will probably end up crashing.

Second Answer

The second answer is nonsensical and I admittedly did a poor job at representing the opinion of my colleague when writing this poll. Let's put it on the fact that I was forced to use their solution.

if bool(value): is stupid because we already are in the context of a boolean evaluation, so the bool casting is redundant. However, their opinion is not without ground in other contexts.

For instance, in the following code, what would you replace XXX with?

{{< highlight python >}} def foo() -> Tuple[str, int]: """some instructions""" # ...

def bar() -> XXX: s, i = foo() return s and i {{< /highlight >}}

If you answered bool, you are wrong. XXX must be replaced by Union[str, int].

error: Incompatible return value type (got "Union[str, int]", expected "bool")

That's because Python logical operators do not return booleans! In the case of the and operator, it returns the first value that cannot be evaluated to True.

Thus, if you meant the bar function to return a boolean, you need to cast the whole expression as bool before returning it.

{{< highlight python >}} def bar() -> bool: s, i = foo() return bool(s and i) {{< /highlight >}}

You can also only cast elements that are not comparisons, because, at least, comparisons do return booleans:

{{< highlight python >}} def bar() -> bool: s, i = foo return bool(s) and i != 0 {{< /highlight >}}

The fact that Python logical operators return arbitrary values is very surprising to people that come from some other languages, such as C or even PHP, where the logical operators return a truth value. For people coming from very strongly typed languages such as Ocaml or Rust, this is difficult to even consider, because logical operators are typed and would only accept booleans as operands in the first place.

{{< highlight bash >}} $ echo '''#include <stdio.h>

int main(){ printf("> %d\n", 2 && 3); return 0; }''' | gcc -o /tmp/condition -x c -

$ /tmp/condition

1

$ php -r 'echo 2 && 3;' 1 {{< /highlight >}}

Once you consider that you need to explicitly cast to bool all of your values in a logical expression to output a boolean, the readability and robustness/safety arguments are identical to those of the first answer.

Third Answer

The third answer is the first that is not relying on the Python language specification to express a condition. It explicitly states which particular characteristic of the content of the value variable is studied and must not be equal to 0 and it returns an unambiguous result: a truth value: True or False.

Regarding code safety, the result is mixed. It is the only answer out of the four that will crash the program during runtime if value is the None value.

TypeError: object of type 'NoneType' has no len()

Depending on the purpose of the program, crashing early may or may not be desirable. A ex-colleague working on machine learning told me that they would prefer to have a program crash early because of a type bug like this, than to have the program run for hours only to crash later with a AttributeError: 'NoneType' object has no attribute 'foo'.

Of course, the best option is to test the program, and check the parameters against the API contract so that this situation never occurs. But what if the test is broken (for instance, because it was written with the syntax from the first answer) and it let through a value that was not intended?

Unfortunately, this answer is also susceptible to cause late crashes because of duck typing. The len() function is only a keyword that calls on the __len__ implementation of the receiver (that is the object provided as len()'s argument. This means that this test won't guard against objects implementing __len__, just like str, but not being a string. The perfect builtin example would be the bytestring.

{{< highlight python >}} value = b'bonjour' if len(value) != 0: l = a.split('o') {{< /highlight >}}

TypeError: a bytes-like object is required, not 'str'

All in all, this answer is relatively explicit in what is tested and what's the intended result. But even if an early crash might be desirable, a crash is not certain because duck typing will let invalid objects go through.

Fourth Answer

The fourth answer is pretty straightforward. The meaning is crystal clear: it evaluates to True as long as value is not the empty Unicode string. Anybody gifted with eyes can read this expression and understand what is the intent of the author.

It won't be abused by the empty bytestring. It won't be abused by the None value. It won't be abused by objects implementing __str__.

{{< highlight python >}} class Test: def str(self) -> str: return ""

value = Test() if value != "": print("Foo") {{< /highlight >}}

Foo

Wait, what? It printed Foo?! Oh, yeah. Right. All objects of a different type than str, and even objects of type str if they are not the empty string will make this condition evaluate to True. They are not equal to the empty string. That is exactly what is expressed by the condition.

If what you really mean is that the value must be a string and not the empty string, then, you need to write an even more explicit version of this:

{{< highlight python >}} if isinstance(value, str) and value != "": {{< /highlight >}}

(note: using isinstance() is prefered to type(value) is str just in case someone got the extra-weird idea to subclass str; don't do that. Unfortunately, type(value) is str would have been so much more readable)

A Word About Efficiency

To my very own surprise, there is a significant difference of performance between the four answers. The slowest take twice as much time as the fastest (2.19 times more, to be precise).

It is in the realm of micro-optimizations but since the difference is so large, I think the result is worth including in this post.

For 10 million iterations, run 100 times to clear out statistical noise, the average results are:

Answer Time Ratio
First 1.28s 1
Second 2.81s 2.19
Third 1.92s 1.5
Fourth 1.52s 1.19

Conclusion

As I stated in the poll itself, all four of the answers are totally valid Python code. However some answers will make full-time Python developers abiding by the PEP-8 recommendations cringe. Some even implemented tools that would whine specifically on one of the answers. That's the case of the third answer that will make pylint, a Python code linter, complain about the use of len() to test if an sequence is empty (because a str is also an character sequence):

Do not use `len(SEQUENCE)` to determine if a sequence is empty (len-as-condition)

We also saw during the discussion about the advantages and drawbacks of the first answer that if one makes a proper use of type hinting and static code analysis (e.g. mypy), there may be little code safety problems with it; just a readability issue because the intent of the developers about which use cases are covered remains obscure.

Overall, my personal belief is that the fourth answer (especially when guarded with the isinstance() call) should be prefered because the developer's intent is explicit and one gets exactly what they asked for, without any of the magic offered by the language.


Please post your comments as answers on Twitter or on the Fediverse!