HBase is a distributed, column-oriented open source database derived from Fay Chang's Google paper "Bigtable: A Distributed Storage System for Structured Data." Just as Bigtable leverages the distributed data storage provided by the Google File System, HBase provides Bigtable-like capabilities on top of Hadoop. HBase is a subproject of Apache's Hadoop project. HBase is different from a general relational database, which is a database suitable for unstructured data storage. Another difference is HBase's column-based rather than row-based model.
When do you need HBase?Semi-structured or unstructured data, for data structure fields are not determined or cluttered, it is difficult to extract data according to a concept suitable for HBase. When the business development needs to store the author's email, phone, and address information, the RDBMS needs to be down-maintained, and HBase supports dynamic addition.
Record is very sparse
The number of columns in the RDBMS row is fixed, and the null column wastes storage space. As mentioned above, the Column with HBase null is not stored, which saves space and improves read performance.
Multi-version data
As mentioned above, the Values ​​that are located according to the Row key and the Column key can have any number of version values, so it is very convenient to use HBase for data that needs to store the change history. For example, the Author's Address in the above example will change. Generally, only the latest value is needed in the business, but sometimes it may be necessary to query the historical value.
Large data volume
When the amount of data is getting larger and larger, the RDBMS database can't hold up, and there is a read-write separation strategy. Through a master, it is responsible for write operations, multiple slaves are responsible for read operations, and server costs are doubled. As the pressure increases, the Master can't hold it. At this time, the library is divided, and the data with little correlation is deployed separately. Some join queries cannot be used, and the middle layer needs to be used. As the amount of data increases further, the record of a table becomes larger and larger, and the query becomes very slow. Therefore, it is necessary to divide the table, for example, by modulo the ID into multiple tables to reduce the number of records of a single table. People who have experienced these things know how toss the process. HBase is simple, just add a machine, HBase will automatically split horizontally, and seamless integration with Hadoop guarantees its data reliability (HDFS) and high performance (MapReduce) for massive data analysis.
Some basic concepts of HTableRow key
Row master key, HBase does not support conditional query and Order by query, read records can only be scanned by Row key (and its range) or full table, so Row key needs to be designed according to the business to take advantage of its storage sorting feature (Table by Row The key lexicographic order is like 1, 10, 100, 11, 2) to improve performance.
Column Family
When the table is created, it is declared that each Column Family is a storage unit. In the above example, an HBase table blog was designed. The table has two column families: arTIcle and author.
Column
Each column of HBase belongs to a column family, prefixed by the column family name, such as column arTIcle: TItle and arTIcle: content belong to the article column family, author: name and author: nickname belong to the author column family.
Columns can be dynamically added without defining a table. The Columns of the same Column Family are clustered on one storage unit and sorted according to the Column key. Therefore, the Column with the same I/O characteristics should be designed in a Column Family. To improve performance.
Timestamp
HBase determines a piece of data through row and column. The value of this data may have multiple versions. The values ​​of different versions are sorted in reverse chronological order, that is, the latest data is ranked first, and the latest version is returned by default. In the above example, the author:nickname value of row key=1 has two versions, which are 1317180070811 corresponding to "Yiye Dujiang" and 1317180718830 corresponding to "yedu" (corresponding to the actual business can be understood as modifying the nickname at a certain time) Yedu, but the old value still exists). Timestamp defaults to the current system time (accurate to milliseconds) and can also be specified when writing data.
Value
Each value is uniquely indexed by 4 keys, tableName+RowKey+ColumnKey+Timestamp=â€value, for example, {tableName='blog', RowKey='1', ColumnName='author:nickname',Timestamp=' 1317180718830 The only value indexed by '} is "yedu".
Storage type
TableName is a string
RowKey and ColumnName are binary values ​​(Java type byte[])
Timestamp is a 64-bit integer (Java type long)
Value is a byte array (Java type byte[]).
Interpret the storage structure of HTable as
That is, HTable is automatically sorted by Row key. Each Row contains any number of Columns. Columns are automatically sorted by Column key. Each Column contains any number of Values. Understanding the storage structure will help iterate through the results of the query.
Replace 221 Lever Connector,Lever Connectors,Compact Connectors,Compact Splicing Connector
Guangdong Ojun Technology Co., Ltd. , https://www.ojunconnector.com