Hive优化笔记

  • 列裁剪 hive.optimize.cp=true(默认值为真)
  • 分区裁剪 hive.optimize.pruner=true(默认值为真)
  • JOIN 小表放左边

    如果 Join 的 key 相同,不管有多少个表,都会则会合并为一个 Map-Reduce

    INSERT OVERWRITE TABLE pv_users
     SELECT pv.pageid, u.age FROM page_view p
     JOIN user u ON (pv.userid = u.userid)
     JOIN newuser x ON (u.userid = x.userid);
    

    如果 Join 的条件不相同

     INSERT OVERWRITE TABLE pv_users
        SELECT pv.pageid, u.age FROM page_view p
        JOIN user u ON (pv.userid = u.userid)
        JOIN newuser x on (u.age = x.age);
    
    上面和下面是一个效果
    
     INSERT OVERWRITE TABLE tmptable
        SELECT * FROM page_view p JOIN user u
        ON (pv.userid = u.userid);
     INSERT OVERWRITE TABLE pv_users
        SELECT x.pageid, x.age FROM tmptable x
        JOIN newuser y ON (x.age = y.age);
    
  • group by , MAP端做聚合

hive.map.aggr=true(用于设定是否在 map 端进行聚合,默认值为真) hive.groupby.mapaggr.checkinterval=100000(用于设定 map 端进行聚合操作的条目数)

有数据倾斜时进行负载均衡,设置hive.groupby.skewindata=true(默认为true)

MapReduce优化

http://www.idryman.org/blog/2014/03/05/hadoop-performance-tuning-best-practices/

1、Memory tuning


mapred.child.java.opts
-Xms1024M -Xmx2048M

2、Minimize the map disk spill

  • compress mapper output
  • Use 70% of heap memory for spill buffer in mapper

    <property>
       <name>mapred.compress.map.output</name>
       <value>true</value>
    </property>
    <property>
        <name>mapred.map.output.compression.codec</name>
        <value>com.hadoop.compression.lzo.LzoCodec</value>
    </property>
    <property>
        <name>io.sort.mb</name>
        <value>800</value>
    </property>
    

3、Tuning mapper tasks

4、Minimize your mapper output

  • Filter out records on mapper side, not on reducer side.
  • Use minimal data to form your map output key and map output value.
  • Extends BinaryComparable interface or use Text for your map output key
  • Set mapper output to be compressed