AWS DynamoDB 學習筆記

DynamoDB 設計理想源自於 Amazon 的論文： Dynamo: Amazon’s Highly Available Key-value Store, 2007，被稱為是 NoSQL 代表之作。

這篇由 Werner Vogels (AWS CTO) 寫的 Blog: Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications，提到了 DynamoDB 背後設計的歷史、包含以前的 SimpleDB，文章提到幾個設計的重點：

Fast (快)
Managed (好)
Scalable (好)
Durable and Highly Available (好)
Flexible (好)
Low cost (便宜)

簡單說就是：快、好管理、玩不壞、彈性，最重要的是，便宜，這不就是下面這張圖 XDD

DynamoDB 辦到了！

Forrester Wave: Big Data NoSQL, Q3 2016: DynamoDB 完勝其他 Cloud NoSQL.

Anyway，以下整理的是 DynamoDB 的重要概念、背後運作的原理。圖文資料都出自官方文件：DynamoDB Developer Guide 。 (有點像在翻譯練習 XD)

核心元件 (Core Components)

經常會跟 MongoDB 比較，概念很類似：

Tables:
- 類似於 RDBMS 的 Table.
- DynamoDB Table 是一個儲存集合單位。
- 相當於 MongoDB 的 Collection
Items:
- 每個 Table 可以有多個 Items，相當於 RDBMS 的 Rows。
- 每個 Items 可包含多個 Attributes
- 相當於 MongoDB 的 Document
Attributes:
- 每個 Items 由一個或多個 Attributes 組成
- Attribute 的資料型態有

Primary Key

DynamoDB 支援兩種 Primary Keys:

Partition key: 又叫 hash attribute ，指定某一個 attribute 當作 primary key (unique key)，稱作 partition key. DynamoDB 利用這個值透過內部的 hash function，然後依據 hash 過的值，決定資料要放在哪個實體的儲存體 (Storage)。基本上，不會有重複的 hash value，也就是不會有重複的 partition key。
Partition key and sort key: 使用兩個 attribute 的複合鍵 (composite key) = partition key + sort key = hash + range
- sort key 又叫 range attribute
- 最常用的例子就是 unique key + date range 這樣的組合。
  *

Secondary Indexes

一個 Table 除了 Primary Key，可以有一個或多個 Secondary Indexes，每個 Table 最多各五個 GSI 跟 LSI:

Global Secondary Indexes (GSI):
Local Secondary Indexes (LSI):

要注意的是，DynamoDB 不管是 Primary Key or Secondary Indexes，在 Table 建立之後就無法修改。

Data Type

Scalar Types (純量): number, string, binary, Boolean, and null.
Document Types: list and map.
Set Types: multiple scalar values, 包含 string set, number set, and binary set.

Read Consistency (讀取一致性模型)

DynamoDB 設計在每個 Region AZ 都可以快速的 Replica 資料，通常會在 1s 以內或更少。讀取模式有兩種：

Eventually Consistent Reads (最終一致性, ECR): 每秒可以讀 2 次, 每次 4KB 大小。
Strongly Consistent Reads (強制一致性, SCR): 每秒可以讀 1 次, 每次 4KB 大小。

這兩個的差異：ECR 不會反映最近完成的寫入操作結果，而 SCR 則一定會反應最近寫入的結果。

因為 DynamoDB 本身在 AWS Region 裡都是跨 AZ，每個 Table 都會存在各地三個副本 (Reclica)。

透過 API 指定用什麼方式，預設是 Eventually Consistent Reads，以下是 Node.js 的範例：

var params = {
TableName: ‘STRING_VALUE’, /* required */
ConsistentRead: true || false,
};
dynamodb.getItem(params, function(err, data) {
if (err) console.log(err, err.stack); // an error occurred
else console.log(data); // successful response
});

Provisioned Throughput

DynamoDB 每個 Table 都有讀寫能力單元 (Capacity Units) 的設定，稱作 Read Capacity Units (RCU), Write Capacity Units (WCU).

Read Capacity Units (RCU): 每次讀取單位為 4KB
- Strongly Consistent Reads 每秒讀一次
- Eventually Consistent Reads 每秒讀兩次
- 如果讀寫大小超過 4KB，那麼就會需要額外的 RCU
Write Capacity Units (WCU): 每次寫入單位為 1KB，超過大小就會額外消耗 WCU
Secondary Indexes 會另外消耗 Capacity Units

RCU / WCU 這兩個值會影響效能，也會依據需求收費。

DynamoDB 讀寫的 API:

Read:
- GetItem: 一次取回一個 Item
- BatchGetItem: 一次操作最多取回 100 Items
Write:
- PutItem / UpdateItem / DeleteItem: 單一個 Item 操作
- BatchWriteItem: 一次操作，最多 Put / Delete 25 Items

另外，Provisioned Throughput 可以買 Reserved Capacity。

Guidelines for Working with Tables

Partition Behavior of Table

一個 partition 最多提供 3000 RCU / 1000 WCU。建立 Table 時，如果指定 1000 RCU / 500 WCU，那麼需要的 Partition 計算公式如下：

( RCU / 3,000 ) + ( WCU / 1,000 ) = initialPartitions (rounded up)
e.g., ( 1,000 / 3,000 ) + ( 500 / 1,000 ) = 0.8333 --> 1

所以一個 partition 可以滿足上述的需求。如果 RCU / WCU = 1000，那麼需要的 partition：

( 1,000 / 3,000 ) + ( 1,000 / 1,000 ) = 1.333 --> 2

Partition Split

一個 partition 大約可以儲存 10GiB 的資料，

以下兩個條件會發生 partition split:

增加 capacity throughput
需要增加 storage 空間

Increased Provisioned Throughput Settings

建立一個 Table ，然後有 5,000 RCU、2,000 WCU，那麼初始的時候就會有 4 個 Partitions，計算公式如下：

( 5000 / 3,000 ) + ( 2,000 / 1,000 ) = 3.6667 --> 4

4 個 partition 將會被配份使用 1,250 RCU (5000/4)、500 WCU (2000/4)。

如果使用者把 RCU 調整成 8,000，那麼既有的四個 partition 就無法滿足需求，DynamoDB 會自動加倍 partition，變成 8 partitions。如下圖：

最後再把資料平均分配到新的 partition。而每個 partition 的 RCU / WCU 會變成:

RCU: 8000 / 8 = 1000
WCU: 2000 / 8 = 250

Increased Storage Requirements

當資料量超過一個 partition 大小 10GB 的時候，就會自動長出新的。

上一個例子最後有 8 partitions，如果其中一個超過 10GB

Use Burst Capacity Sparingly

因為每個 partition 都有一定的 RCU / WCU，所以也就變成每個 Table 不管使用者要多少，實際上，都會有 buffer，所以如果有瞬間量的需求 (bursts 爆炸)，實際上是可以撐一下的。

DynamoDB 保留了五分鐘的 burst 給 RCU / WCU。在這段時間的 R/W 動作，可以非常快速地被消化，基本上會比定義的還要快。

但是不要把 burst 的 RCU / WCU 當成設計的一部份，因為 DynamoDB 會預先使用這些 Capacity 作維護任務。

未來 burst 可能可以讓使用者自行設定。

Cache Popular Items

AWS 官方建議，如果有一些資料存取比較頻繁，建議使用 In Memory 的方式，像是 ElasticCache。

Development with DynamoDB

DynamoDB 本身都是透過 Web Service 存取，所以沒有 RDBMS Connection 的概念，所以也不會有 Connection Pool 的問題。
AWS 提供 DynamoDB local 版，需要 jre6 以上，使用方式如下：

1 2	wget http://dynamodb-local.s3-website-us-west-2.amazonaws.com/dynamodb_local_latest.tar.gz java -Djava.library.path=./DynamoDBLocal_lib -jar DynamoDBLocal.jar -sharedDb

相關資料：

使用時機

AWS 資料儲存有很多方式，不管是 S3 / RDS / DynamoDB / Glacier / ElasticCache / HDFS …. 在 AWS Whitepaper: Storage Options in the AWS Cloud 有很詳細的說明。

不過要快速瞭解的話，下面這張圖 (出自 AWS Big Data) 是不錯的參考：

參考資料

DynamoDB 概述
Amazon DynamoDB 筆記
DynamoDB 深度體驗 (InfoQ 簡中)
解读 NoSQL 技术代表之作 Dynamo
Performance boost and cost savings for DynamoDB
Deep Dive on Amazon DynamoDB (Youtube, AWS Summit)
Dynamo: Amazon’s Highly Available Key-value Store: DynamoDB 設計理論基礎，作者包含了現任 AWS CTO – Werner Vogels
Eventually Consistent – Revisited by Werner Vogels, 中文翻譯
Amazon DynamoDB – a Fast and Scalable NoSQL Database Service Designed for Internet Scale Applications, by Werner Vogels
Brewer’s CAP Theorem, 1999 by Eric Brewer
Forrester Wave: Big Data NoSQL, Q3 2016

AWS 官方文件

Tags: AWS DynamoDB, AWS DynamoDB 學習筆記