martes, 27 de marzo de 2018

Why Data Scientists prefer Python?

I usually write code in Java and C. Then, to start answering this question, I performed a simple test by writing the same code in the three programming languages: C, Java, and Python. My goal was to understand the differences among performance of of the three approaches, specifically regarding CPU and memory usage.

Then, I selected an algorithm that can be easily ported to Python too: Retrieving the second maximum value among three random positive integers. In order to get valid timing measures, this process was repeated 1000000 times. (+)

It must be pointed out that this is not a real benchmark test, because that wasn't what I was looking for. Even, the code does not make any input/output or network operations.

All tests were done in the same environment(!):
$ uname -a
Linux <this> 4.15.10-300.fc27.x86_64 #1 SMP Thu Mar 15 17:13:04 UTC 2018 x86_64 x86_64 x86_64 GNU/Linux 


The executables were created by using the following tools:
C:
$ gcc --version
gcc (GCC) 7.3.1 20180303 (Red Hat 7.3.1-5)

Java:
$ javac -version
javac 1.8.0_161
 
$ java -version
openjdk version "1.8.0_161"
OpenJDK Runtime Environment (build 1.8.0_161-b14)
OpenJDK 64-Bit Server VM (build 25.161-b14, mixed mode)

Python:
$ python --version
Python 2.7.14


The size of the codes, excluding JAVA Virtual Machine and Python Interpreter, are:
C: 8568 bytes, about 35 lines of code.
Java: 2721 bytes, about 40 lines of code.
Python: 1486 bytes, about 30 lines of code.

The commands to run the programs were:
C:
$ ./secondmax_c 1000000
Java:
$ java -jar SecondMax.jar 1000000
Python:
$ python secondmax_p.py 1000000

The results, in seconds, are:
C:
Best time: 0,038512 s
Worst time: 0,041267 s
Average time of 10 samples: 0,039393 s (*)
Java:
Best time: 0,190164966 s
Worst time: 0,213463167 s
Average time of 10 samples: 0,1982906219 s (*)
Python:
Best time: 14,326186 s
Worst time: 14,948839 s
Average time of 10 samples: 14,6719912 s (*)


So, if these tests show that the worst performance in C was around 347 times faster than the best performance in Python, and the worst performance in Java was around 67 times faster than the best performance in Python: Why do data scientists prefer Python

The answer is straightforward, and it might sound pretty obvious, even before testing: Python is easiest.

C and Java demand professional programming skills that must be acquired after a personal process of growth that sometimes may be slow.

However, a data scientist with her/his computer can understand and develop an acceptable Python code quickest.

Furthermore, Python has many libraries and solutions that can be freely included in the applications, reducing the development time, and therefore, the man hours getting a valuable answer to the different issues.

Anyway, I still have two open questions:
1.- What should be the best performance's option from System Administrator's point of view?
2.- Which one is the choice if the target environment of the application is an embedded system? In example: 454 MHz ARM 9 CPU with only 128 MB of RAM?

Notes:
(+) If you want testing my codes in your environment, just write a comment below and I'll send these. 
(!) To avoid comparing apples to oranges, I have omitted in this post the results after testing the three codes in different environments. It was tested too the performance of the C version in an embedded system: 454 MHz ARM 9 CPU with only 128 MB of RAM; and the performance of the PySpark version in a Cloudera 5.12.0-0 cluster.
(*) The average performance was only included because this is a standard measurement. I think that the value of the "worst time" is the most accurate predictor of the real time response capabilities of an executable program.