Runtime Comparisons¶
The performance of a database driver depends on many factors: the hardware used, the network used, the properties of the sample dataset, etc. It is strongly suggested to collect your own performance measurements before making any important decisions.
In this performance scenario, the following database drivers are compared:
An Exasol cluster with the following specifications was used:
20 nodes
800+ CPU cores with hyper-threading
14 Tb of RAM
10 Gb private network connections
1 Gb public network connections
Three different scenarios were evaluated for each of the database drivers:
Fetching low random data set using server in the same data center
Fetching high random data set using server in the same data center
Fetching data set using local laptop behind VPN and Wifi network (slow network)
For each of the scenarios, there were 10 million rows in the table with mixed data types. The bash command time
was used to measure the total execution duration in seconds.
Results¶
Note
All results are recorded in seconds.
Scenario |
Low random |
High random |
Slow network |
---|---|---|---|
PyODBC - fetchall |
106 |
107 |
|
TurbODBC - fetchall |
56 |
55 |
|
PyExasol - fetchall |
32 |
39 |
126 |
PyExasol - fetchall+zlib |
92 |
||
TurbODBC - fetchallnumpy |
15 |
15 |
|
TurbODBC - fetchallarrow |
14 |
14 |
|
PyExasol - export_to_pandas |
11 |
21 |
77 |
PyExasol - export_to_pandas+zlib |
28 |
53 |
29 |
PyExasol - export_parallel |
5 |
7 |
Conclusions¶
PyODBC’s performance is poor (no surprise).
PyExasol standard fetching is faster than TurbODBC, but it happens mostly due to fewer ops with Python objects and due to zip() magic.
TurbODBC optimized fetching as numpy or arrow is very efficient and consistent, well done!
PyExasol export to pandas performance may vary depending on the randomness of the data set. It highly depends on pandas CSV reader.
PyExasol fetch and export with ZLIB compression is very good for slow network scenarios, but it is bad for fast networks.
PyExasol parallel export beats everything else because it fully utilizes multiple CPU cores.
How to Run Your Own Performance Measurements¶
It is strongly encouraged that you collect your own performance measurements. You may use the scripts provided with PyExasol as the starting point. Make sure to use your production Exasol cluster for measurements. Please do not use Exasol running in Docker locally; it eliminates the whole point of evaluating the performance.
Install PyODBC, TurbODBC, PyExasol, pandas.
Install Exasol ODBC driver.
Download the PyExasol source code.
Open /performance/ directory and edit the file _config.py. Input your Exasol credentials, set the table name, and other settings. Set the path to the ODBC driver.
(Optional) Run the script to prepare the dataset for later use in collecting performance metrics:
python 00_prepare.py
That’s all. Now you may run examples in any order like common Python scripts. E.g.:
time python 03_pyexasol_fetch.py