With the rise of big data and data science, storage and retrieval have become a critical pipeline component for data use and analysis. Recently, new data storage technologies have emerged. But the question is: Which one should you choose? Which one is best suited for data engineering?


last year

Data engineering is becoming a huger problem as we access/need to more and more data. The analysis is based on three type of non-typical-SQL databases (as there is much said about these already): Search engine (Elastic Search), Document-based (MongoDB) and column-based (Amazon Redshift).

  • Indexing
    • ES: text-based and flexible due to different analyzers
    • Mongo: by default only on _id but has flexibility with indexing methods, optimal querying for space reduction
    • Redshift: it doesn't have indexing but leverages column ordering to optimize retrieval
  • Sharding
    • ES: Based on lucene and creates a shard which is allocated by the clust in a host automatically. Scales horizontally in production by design and that's why it's so adopted. Master sends the query to each shard before aggregating the results back
    • MongoDB: 3 types of servers: shard, router and config. config notifies router who distributes the query and aggregates results.
    • Redshift: One leader node and several compute nodes. Leader doesn't scale horizontally but precompiles and caches common queries.
  • Aggregation
    • ES: bucketed by exact, temporal, geolocation or ranged values. Permits nested aggregation and allows metrics for each layer. Suffers from intra-document field comparisons. Scripts can be given as custom predicated but it's slow.
    • MongoDB: Has an aggregation pipeline which operates in a stage-wise fashion. Some operations have to be passed to the master node such as group so it doesn't leverage distributed computing. Also suffers from intra-document fields comparison
    • Redshift: Easier migration from SQL, and support intra-row comparisons. Optimized for read-operations.