Bucketing
when required -
At times, even after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge
which allows user to divide table data sets into more manageable parts.
The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket.
CLUSTERED BY clause is used to divide the table into buckets.
Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. Since the data files are equal sized parts, map-side joins will be faster on the bucketed tables.
The above hive.enforce.bucketing = true property sets the number of reduce tasks to be equal to the number of buckets mentioned in the table definition (Which is ‘4’ in our case) and automatically selects the clustered by column from table definition.
1. Creating a staging table to store your data
CREATE EXTERNAL TABLE IF NOT EXISTS randomizer_stats1 (
bintime date,
eventtime string,
epochtime bigint,
a string,
b string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED as TEXTFILE
LOCATION '/user/root/test1';
MSCK REPAIR TABLE randomizer_stats;
2. Create bucketed table
CREATE TABLE IF NOT EXISTS randomizer_stats_bucket (
bintime date,
eventtime string,
epochtime bigint,
a string,
b string
)
CLUSTERED BY (a) into 4 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073';
3. Load data from stagingtbl to bucketed table
from randomizer_stats1 insert into table randomizer_stats_bucket select *;
Table root.randomizer_stats_bucket stats: [numFiles=16, numRows=2051091, totalSize=4093092732, rawDataSize=4091041641]
Bucketing can only be in internal table
4. Check how many data file have created in Hive metastore.
266811192 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000000_0
-rwxrwxrwt 3 root hive 266534042 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000001_0
-rwxrwxrwt 3 root hive 266568650 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000002_0
-rwxrwxrwt 3 root hive 266532503 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000003_0
-rwxrwxrwt 3 root hive 267073918 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000004_0
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not
when required -
At times, even after partitioning on a particular field or fields, the partitioned file size doesn’t match with the actual expectation and remains huge
which allows user to divide table data sets into more manageable parts.
The Bucketing concept is based on Hash function, which depends on the type of the bucketing column. Records which are bucketed by the same column will always be saved in the same bucket.
CLUSTERED BY clause is used to divide the table into buckets.
Bucketing concept also provides the flexibility to keep the records in each bucket to be sorted by one or more columns. Since the data files are equal sized parts, map-side joins will be faster on the bucketed tables.
The above hive.enforce.bucketing = true property sets the number of reduce tasks to be equal to the number of buckets mentioned in the table definition (Which is ‘4’ in our case) and automatically selects the clustered by column from table definition.
1. Creating a staging table to store your data
CREATE EXTERNAL TABLE IF NOT EXISTS randomizer_stats1 (
bintime date,
eventtime string,
epochtime bigint,
a string,
b string
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073'
STORED as TEXTFILE
LOCATION '/user/root/test1';
MSCK REPAIR TABLE randomizer_stats;
2. Create bucketed table
CREATE TABLE IF NOT EXISTS randomizer_stats_bucket (
bintime date,
eventtime string,
epochtime bigint,
a string,
b string
)
CLUSTERED BY (a) into 4 buckets ROW FORMAT DELIMITED FIELDS TERMINATED BY '\073';
3. Load data from stagingtbl to bucketed table
from randomizer_stats1 insert into table randomizer_stats_bucket select *;
Table root.randomizer_stats_bucket stats: [numFiles=16, numRows=2051091, totalSize=4093092732, rawDataSize=4091041641]
Bucketing can only be in internal table
4. Check how many data file have created in Hive metastore.
266811192 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000000_0
-rwxrwxrwt 3 root hive 266534042 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000001_0
-rwxrwxrwt 3 root hive 266568650 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000002_0
-rwxrwxrwt 3 root hive 266532503 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000003_0
-rwxrwxrwt 3 root hive 267073918 2017-08-30 11:10 /user/hive/warehouse/root.db/randomizer_stats_bucket/000004_0
A Bloom filter is a space-efficient probabilistic data structure, conceived by Burton Howard Bloom in 1970, that is used to test whether an element is a member of a set. False positive matches are possible, but false negatives are not
Comments
Post a Comment