- 列裁剪 hive.optimize.cp=true(默认值为真)
- 分区裁剪 hive.optimize.pruner=true(默认值为真)
JOIN 小表放左边
如果 Join 的 key 相同,不管有多少个表,都会则会合并为一个 Map-Reduce
INSERT OVERWRITE TABLE pv_users SELECT pv.pageid, u.age FROM page_view p JOIN user u ON (pv.userid = u.userid) JOIN newuser x ON (u.userid = x.userid);
如果 Join 的条件不相同
INSERT OVERWRITE TABLE pv_users SELECT pv.pageid, u.age FROM page_view p JOIN user u ON (pv.userid = u.userid) JOIN newuser x on (u.age = x.age); 上面和下面是一个效果 INSERT OVERWRITE TABLE tmptable SELECT * FROM page_view p JOIN user u ON (pv.userid = u.userid); INSERT OVERWRITE TABLE pv_users SELECT x.pageid, x.age FROM tmptable x JOIN newuser y ON (x.age = y.age);
- group by , MAP端做聚合
hive.map.aggr=true(用于设定是否在 map 端进行聚合,默认值为真) hive.groupby.mapaggr.checkinterval=100000(用于设定 map 端进行聚合操作的条目数)
有数据倾斜时进行负载均衡,设置hive.groupby.skewindata=true(默认为true)
MapReduce优化
http://www.idryman.org/blog/2014/03/05/hadoop-performance-tuning-best-practices/
1、Memory tuning
2、Minimize the map disk spill
- compress mapper output
Use 70% of heap memory for spill buffer in mapper
<property> <name>mapred.compress.map.output</name> <value>true</value> </property> <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzoCodec</value> </property> <property> <name>io.sort.mb</name> <value>800</value> </property>
3、Tuning mapper tasks
4、Minimize your mapper output
- Filter out records on mapper side, not on reducer side.
- Use minimal data to form your map output key and map output value.
- Extends BinaryComparable interface or use Text for your map output key
- Set mapper output to be compressed