Hadoop就業(yè)面試題

2024-06-28 16:01:52

字體：大中小

供稿：網(wǎng)友

以下資料來源于互聯(lián)網(wǎng)，很多都是面試者們?nèi)ッ嬖嚨臅r(shí)候遇到的問題，我對(duì)其中有的問題做了稍許的修改了回答了部分空白的問題，其中里面有些考題出的的確不是很好，但是也不乏有很好的題目，這些都是基于真實(shí)的面試來的，希望對(duì)即將去面試或向繼續(xù)學(xué)習(xí)hadoop，大數(shù)據(jù)分析等的朋友有幫助！

1.0 簡要描述如何安裝配置apache的一個(gè)開源Hadoop，只描述即可，無需列出具體步驟，列出具體步驟更好。

答：第一題：1使用root賬戶登錄

2 修改ip

3 修改host主機(jī)名

4 配置SSH免密碼登錄

5 關(guān)閉防火墻

6 安裝JDK

6 解壓hadoop安裝包

7 配置hadoop的核心文件 hadoop-env.sh，core-site.xml , maPRed-site.xml ， hdfs-site.xml

8 配置hadoop環(huán)境變量

9 格式化 hadoop namenode-format

10 啟動(dòng)節(jié)點(diǎn)start-all.sh

2.0 請(qǐng)列出正常的hadoop集群中hadoop都分別需要啟動(dòng) 哪些進(jìn)程，他們的作用分別都是什么，請(qǐng)盡量列的詳細(xì)一些。

答：namenode：負(fù)責(zé)管理hdfs中文件塊的元數(shù)據(jù)，響應(yīng)客戶端請(qǐng)求，管理datanode上文件block的均衡，維持副本數(shù)量

Secondname:主要負(fù)責(zé)做checkpoint操作；也可以做冷備，對(duì)一定范圍內(nèi)數(shù)據(jù)做快照性備份。

Datanode:存儲(chǔ)數(shù)據(jù)塊，負(fù)責(zé)客戶端對(duì)數(shù)據(jù)塊的io請(qǐng)求

Jobtracker :管理任務(wù)，并將任務(wù)分配給 tasktracker。

Tasktracker: 執(zhí)行JobTracker分配的任務(wù)。

Resourcemanager

Nodemanager

Journalnode

Zookeeper

Zkfc

3.0請(qǐng)寫出以下的shell命令

（1）殺死一個(gè)job

（2）刪除hdfs上的 /tmp/aaa目錄

（3）加入一個(gè)新的存儲(chǔ)節(jié)點(diǎn)和刪除一個(gè)節(jié)點(diǎn)需要執(zhí)行的命令

答：（1）hadoop job –list 得到j(luò)ob的id，然后執(zhí) 行 hadoop job -kill jobId就可以殺死一個(gè)指定jobId的job工作了。

（2）hadoopfs -rmr /tmp/aaa

(3) 增加一個(gè)新的節(jié)點(diǎn)在新的幾點(diǎn)上執(zhí)行

Hadoop daemon.sh start datanode

Hadooop daemon.sh start tasktracker/nodemanager

下線時(shí)，要在conf目錄下的excludes文件中列出要下線的datanode機(jī)器主機(jī)名

然后在主節(jié)點(diǎn)中執(zhí)行 hadoop dfsadmin -refreshnodes à下線一個(gè)datanode

刪除一個(gè)節(jié)點(diǎn)的時(shí)候，只需要在主節(jié)點(diǎn)執(zhí)行

hadoop mradmin -refreshnodes ---à下線一個(gè)tasktracker/nodemanager

4.0 請(qǐng)列出你所知道的hadoop調(diào)度器，并簡要說明其工作方法

答：Fifo schedular :默認(rèn)，先進(jìn)先出的原則

Capacity schedular :計(jì)算能力調(diào)度器，選擇占用最小、優(yōu)先級(jí)高的先執(zhí)行，依此類推。

Fair schedular:公平調(diào)度，所有的 job 具有相同的資源。

5.0 請(qǐng)列出你在工作中使用過的開發(fā)mapreduce的語言

答：java，Hive，（Python，c++）hadoop streaming

6.0 當(dāng)前日志采樣格式為

a , b , c , d

b , b , f , e

a , a , c , f

請(qǐng)你用最熟悉的語言編寫mapreduce，計(jì)算第四列每個(gè)元素出現(xiàn)的個(gè)數(shù)

答：

public classWordCount1 {

public static final String INPUT_PATH ="hdfs://hadoop0:9000/in";

public static final String OUT_PATH ="hdfs://hadoop0:9000/out";

public static void main(String[] args)throws Exception {

Configuration conf = newConfiguration();

FileSystem fileSystem =FileSystem.get(conf);

if(fileSystem.exists(newPath(OUT_PATH))){}

fileSystem.delete(newPath(OUT_PATH),true);

Job job = newJob(conf,WordCount1.class.getSimpleName());

//1.0讀取文件，解析成key,value對(duì)

FileInputFormat.setInputPaths(job,newPath(INPUT_PATH));

//2.0寫上自己的邏輯，對(duì)輸入的可以，value進(jìn)行處理，轉(zhuǎn)換成新的key,value對(duì)進(jìn)行輸出

job.setMapperClass(MyMapper.class);

job.setMapOutputKeyClass(Text.class);

job.setMapOutputValueClass(LongWritable.class);

//3.0對(duì)輸出后的數(shù)據(jù)進(jìn)行分區(qū)

//4.0對(duì)分區(qū)后的數(shù)據(jù)進(jìn)行排序，分組，相同key的value放到一個(gè)集合中

//5.0對(duì)分組后的數(shù)據(jù)進(jìn)行規(guī)約

//6.0對(duì)通過網(wǎng)絡(luò)將map輸出的數(shù)據(jù)拷貝到reduce節(jié)點(diǎn)

//7.0 寫上自己的reduce函數(shù)邏輯，對(duì)map輸出的數(shù)據(jù)進(jìn)行處理

job.setReducerClass(MyReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(LongWritable.class);

FileOutputFormat.setOutputPath(job,new Path(OUT_PATH));

job.waitForCompletion(true);

}

static class MyMapper extendsMapper<LongWritable, Text, Text, LongWritable>{

@Override

protected void map(LongWritablek1, Text v1,

org.apache.hadoop.mapreduce.Mapper.Contextcontext)

throws IOException,InterruptedException {

String[] split =v1.toString().split("/t");

for(String words :split){

context.write(split[3],1);

}

static class MyReducer extends Reducer<Text,LongWritable, Text, LongWritable>{

protected void reduce(Text k2,Iterable<LongWritable> v2,

org.apache.hadoop.mapreduce.Reducer.Contextcontext)

throws IOException,InterruptedException {

Long count = 0L;

for(LongWritable time :v2){

count += time.get();

}

context.write(v2, newLongWritable(count));

}

7.0 你認(rèn)為用java ， streaming ， pipe方式開發(fā)map/reduce ，各有哪些優(yōu)點(diǎn)

就用過 java 和 hiveQL。

Java 寫 mapreduce 可以實(shí)現(xiàn)復(fù)雜的邏輯，如果需求簡單，則顯得繁瑣。

HiveQL 基本都是針對(duì) hive 中的表數(shù)據(jù)進(jìn)行編寫，但對(duì)復(fù)雜的邏輯（雜）很難進(jìn)行實(shí)現(xiàn)。寫起來簡單。

8.0 hive有哪些方式保存元數(shù)據(jù)，各有哪些優(yōu)點(diǎn)

三種：自帶內(nèi)嵌數(shù)據(jù)庫derby，挺小，不常用，只能用于單節(jié)點(diǎn)

MySQL常用

上網(wǎng)上找了下專業(yè)名稱：single user mode..multiuser mode...remote user mode

9.0 請(qǐng)簡述hadoop怎樣實(shí)現(xiàn)二級(jí)排序（就是對(duì)key和value雙排序）

第一種方法是，Reducer將給定key的所有值都緩存起來，然后對(duì)它們?cè)僮鲆粋€(gè)Reducer內(nèi)排序。但是，由于Reducer需要保存給定key的所有值，可能會(huì)導(dǎo)致出現(xiàn)內(nèi)存耗盡的錯(cuò)誤。

第二種方法是，將值的一部分或整個(gè)值加入原始key，生成一個(gè)組合key。這兩種方法各有優(yōu)勢(shì)，第一種方法編寫簡單，但并發(fā)度小，數(shù)據(jù)量大的情況下速度慢(有內(nèi)存耗盡的危險(xiǎn))，

第二種方法則是將排序的任務(wù)交給MapReduce框架shuffle，更符合Hadoop/Reduce的設(shè)計(jì)思想。這篇文章里選擇的是第二種。我們將編寫一個(gè)Partitioner，確保擁有相同key(原始key，不包括添加的部分)的所有數(shù)據(jù)被發(fā)往同一個(gè)Reducer，還將編寫一個(gè)Comparator，以便數(shù)據(jù)到達(dá)Reducer后即按原始key分組。

10.簡述hadoop實(shí)現(xiàn)jion的幾種方法

Map side join----大小表join的場景，可以借助distributed cache

Reduce side join

11.0 請(qǐng)用java實(shí)現(xiàn)非遞歸二分查詢

1. public class BinarySearchClass

2. {

4. public static int binary_search(int[] array, int value)

5. {

6. int beginIndex = 0;// 低位下標(biāo)

7. int endIndex = array.length - 1;// 高位下標(biāo)

8. int midIndex = -1;

9. while (beginIndex <= endIndex) {

10. midIndex = beginIndex + (endIndex - beginIndex) / 2;//防止溢出

11. if (value == array[midIndex]) {

12. return midIndex;

13. } else if (value < array[midIndex]) {

14. endIndex = midIndex - 1;

15. } else {

16. beginIndex = midIndex + 1;

17. }

18. }

19. return -1;

20. //找到了，返回找到的數(shù)值的下標(biāo)，沒找到，返回-1

21. }

22.

23.

24. //start 提示：自動(dòng)閱卷起始唯一標(biāo)識(shí)，請(qǐng)勿刪除或增加。

25. public static void main(String[] args)

26. {

27. System.out.println("Start...");

28. int[] myArray = new int[] { 1, 2, 3, 5, 6, 7, 8, 9 };

29. System.out.println("查找數(shù)字8的下標(biāo)：");

30. System.out.println(binary_search(myArray, 8));

31. }

32. //end //提示：自動(dòng)閱卷結(jié)束唯一標(biāo)識(shí)，請(qǐng)勿刪除或增加。

33. }

12.0 請(qǐng)簡述mapreduce中的combine和partition的作用

答：combiner是發(fā)生在map的最后一個(gè)階段，其原理也是一個(gè)小型的reducer，主要作用是減少輸出到reduce的數(shù)據(jù)量，緩解網(wǎng)絡(luò)傳輸瓶頸，提高reducer的執(zhí)行效率。

partition的主要作用將map階段產(chǎn)生的所有kv對(duì)分配給不同的reducer task處理，可以將reduce階段的處理負(fù)載進(jìn)行分?jǐn)?/p>

13.0 hive內(nèi)部表和外部表的區(qū)別

Hive 向內(nèi)部表導(dǎo)入數(shù)據(jù)時(shí)，會(huì)將數(shù)據(jù)移動(dòng)到數(shù)據(jù)倉庫指向的路徑；若是外部表，數(shù)據(jù)的具體存放目錄由用戶建表時(shí)指定

在刪除表的時(shí)候，內(nèi)部表的元數(shù)據(jù)和數(shù)據(jù)會(huì)被一起刪除，

而外部表只刪除元數(shù)據(jù)，不刪除數(shù)據(jù)。

這樣外部表相對(duì)來說更加安全些，數(shù)據(jù)組織也更加靈活，方便共享源數(shù)據(jù)。

14. Hbase的rowKey怎么創(chuàng)建比較好？列簇怎么創(chuàng)建比較好？

答：

rowKey最好要?jiǎng)?chuàng)建有規(guī)則的rowKey，即最好是有序的。

經(jīng)常需要批量讀取的數(shù)據(jù)應(yīng)該讓他們的rowkey連續(xù)；

將經(jīng)常需要作為條件查詢的關(guān)鍵詞組織到rowkey中；

列族的創(chuàng)建：

按照業(yè)務(wù)特點(diǎn)，把數(shù)據(jù)歸類，不同類別的放在不同列族

15. 用mapreduce怎么處理數(shù)據(jù)傾斜問題

本質(zhì)：讓各分區(qū)的數(shù)據(jù)分布均勻

可以根據(jù)業(yè)務(wù)特點(diǎn)，設(shè)置合適的partition策略

如果事先根本不知道數(shù)據(jù)的分布規(guī)律，利用隨機(jī)抽樣器抽樣后生成partition策略再處理

16. hadoop框架怎么來優(yōu)化

答：

可以從很多方面來進(jìn)行：比如hdfs怎么優(yōu)化，mapreduce程序怎么優(yōu)化，yarn的job調(diào)度怎么優(yōu)化，hbase優(yōu)化，hive優(yōu)化。。。。。。。

17. hbase內(nèi)部機(jī)制是什么

答：

Hbase是一個(gè)能適應(yīng)聯(lián)機(jī)業(yè)務(wù)的數(shù)據(jù)庫系統(tǒng)

物理存儲(chǔ)：hbase的持久化數(shù)據(jù)是存放在hdfs上

存儲(chǔ)管理：一個(gè)表是劃分為很多region的，這些region分布式地存放在很多regionserver上

Region內(nèi)部還可以劃分為store，store內(nèi)部有memstore和storefile

版本管理：hbase中的數(shù)據(jù)更新本質(zhì)上是不斷追加新的版本，通過compact操作來做版本間的文件合并

Region的split

集群管理：zookeeper + hmaster（職責(zé)） + hregionserver（職責(zé)）

18. 我們?cè)陂_發(fā)分布式計(jì)算job的時(shí)候，是否可以去掉reduce階段

答：可以，例如我們的集群就是為了存儲(chǔ)文件而設(shè)計(jì)的，不涉及到數(shù)據(jù)的計(jì)算，就可以將mapReduce都省掉。

比如，流量運(yùn)營項(xiàng)目中的行為軌跡增強(qiáng)功能部分

怎么樣才能實(shí)現(xiàn)去掉reduce階段

去掉之后就不排序了，不進(jìn)行shuffle操作了

19 hadoop中常用的數(shù)據(jù)壓縮算法

答：

Lzo

Gzip

Default

Snapyy

如果要對(duì)數(shù)據(jù)進(jìn)行壓縮，最好是將原始數(shù)據(jù)轉(zhuǎn)為SequenceFile 或者 Parquet File（Spark）

20. mapreduce的調(diào)度模式（題意模糊，可以理解為yarn的調(diào)度模式，也可以理解為mr的內(nèi)部工作流程）

答： appmaster作為調(diào)度主管，管理maptask和reducetask

Appmaster負(fù)責(zé)啟動(dòng)、監(jiān)控maptask和reducetask

Maptask處理完成之后，appmaster會(huì)監(jiān)控到，然后將其輸出結(jié)果通知給reducetask，然后reducetask從map端拉取文件，然后處理；

當(dāng)reduce階段全部完成之后，appmaster還要向resourcemanager注銷自己

21. hive底層與數(shù)據(jù)庫交互原理

答：

Hive的查詢功能是由hdfs + mapreduce結(jié)合起來實(shí)現(xiàn)的

Hive與mysql的關(guān)系：只是借用mysql來存儲(chǔ)hive中的表的元數(shù)據(jù)信息，稱為metastore

22. hbase過濾器實(shí)現(xiàn)原則

答：可以說一下過濾器的父類（比較過濾器，專用過濾器）

過濾器有什么用途：

增強(qiáng)hbase查詢數(shù)據(jù)的功能

減少服務(wù)端返回給客戶端的數(shù)據(jù)量

23. reduce之后數(shù)據(jù)的輸出量有多大（結(jié)合具體場景，比如pi）

Sca階段的增強(qiáng)日志（1.5T---2T）

過濾性質(zhì)的mr程序，輸出比輸入少

解析性質(zhì)的mr程序，輸出比輸入多（找共同朋友）

24. 現(xiàn)場出問題測試mapreduce掌握情況和hive的ql語言掌握情況

25.datanode在什么情況下不會(huì)備份數(shù)據(jù)

答：在客戶端上傳文件時(shí)指定文件副本數(shù)量為1

26.combine出現(xiàn)在哪個(gè)過程

答：shuffle過程中

具體來說，是在maptask輸出的數(shù)據(jù)從內(nèi)存溢出到磁盤，可能會(huì)調(diào)多次

Combiner使用時(shí)候要特別謹(jǐn)慎，不能影響最后的邏輯結(jié)果

27. hdfs的體系結(jié)構(gòu)

答：

集群架構(gòu)：

namenode datanode secondarynamenode

(active namenode ,standby namenode)journalnode zkfc

內(nèi)部工作機(jī)制：

數(shù)據(jù)是分布式存儲(chǔ)的

對(duì)外提供一個(gè)統(tǒng)一的目錄結(jié)構(gòu)

對(duì)外提供一個(gè)具體的響應(yīng)者（namenode）

數(shù)據(jù)的block機(jī)制，副本機(jī)制

Namenode和datanode的工作職責(zé)和機(jī)制

讀寫數(shù)據(jù)流程

28. flush的過程

答：flush是在內(nèi)存的基礎(chǔ)上進(jìn)行的，首先寫入文件的時(shí)候，會(huì)先將文件寫到內(nèi)存中，當(dāng)內(nèi)存寫滿的時(shí)候，一次性的將文件全部都寫到硬盤中去保存，并清空緩存中的文件，

29. 什么是隊(duì)列

答：是一種調(diào)度策略，機(jī)制是先進(jìn)先出

30. List與set的區(qū)別

答：List和Set都是接口。他們各自有自己的實(shí)現(xiàn)類，有無順序的實(shí)現(xiàn)類，也有有順序的實(shí)現(xiàn)類。最大的不同就是List是可以重復(fù)的。而Set是不能重復(fù)的。List適合經(jīng)常追加數(shù)據(jù)，插入，刪除數(shù)據(jù)。但隨即取數(shù)效率比較低。Set適合經(jīng)常地隨即儲(chǔ)存，插入，刪除。但是在遍歷時(shí)效率比較低。

31.數(shù)據(jù)的三范式

答：第一范式（）無重復(fù)的列

第二范式（2NF）屬性完全依賴于主鍵 [消除部分子函數(shù)依賴]第三范式（3NF）屬性不依賴于其它非主屬性 [消除傳遞依賴]

32.三個(gè)datanode中當(dāng)有一個(gè)datanode出現(xiàn)錯(cuò)誤時(shí)會(huì)怎樣？

答：

Namenode會(huì)通過心跳機(jī)制感知到datanode下線

會(huì)將這個(gè)datanode上的block塊在集群中重新復(fù)制一份，恢復(fù)文件的副本數(shù)量

會(huì)引發(fā)運(yùn)維團(tuán)隊(duì)快速響應(yīng)，派出同事對(duì)下線datanode進(jìn)行檢測和修復(fù)，然后重新上線

33.sqoop在導(dǎo)入數(shù)據(jù)到mysql中，如何不重復(fù)導(dǎo)入數(shù)據(jù)，如果存在數(shù)據(jù)問題，sqoop如何處理？

答：FAILED java.util.NoSuchElementException

此錯(cuò)誤的原因?yàn)閟qoop解析文件的字段與MySql數(shù)據(jù)庫的表的字段對(duì)應(yīng)不上造成的。因此需要在執(zhí)行的時(shí)候給sqoop增加參數(shù)，告訴sqoop文件的分隔符，使它能夠正確的解析文件字段。

hive默認(rèn)的字段分隔符為'/001'

34.描述一下hadoop中，有哪些地方使用到了緩存機(jī)制，作用分別是什么？

答：

Shuffle中

Hbase----客戶端/regionserver

35.MapReduce優(yōu)化經(jīng)驗(yàn)

答：(1.)設(shè)置合理的map和reduce的個(gè)數(shù)。合理設(shè)置blocksize

(2.)避免出現(xiàn)數(shù)據(jù)傾斜

(3.combine函數(shù)

(4.對(duì)數(shù)據(jù)進(jìn)行壓縮

(5.小文件處理優(yōu)化：事先合并成大文件，combineTextInputformat，在hdfs上用mapreduce將小文件合并成SequenceFile大文件（key:文件名，value：文件內(nèi)容）

(6.參數(shù)優(yōu)化

36.請(qǐng)列舉出曾經(jīng)修改過的/etc/下面的文件，并說明修改要解決什么問題？

答：/etc/profile這個(gè)文件，主要是用來配置環(huán)境變量。讓hadoop命令可以在任意目錄下面執(zhí)行。

/ect/sudoers

/etc/hosts

/etc/sysconfig/network

/etc/inittab

37.請(qǐng)描述一下開發(fā)過程中如何對(duì)上面的程序進(jìn)行性能分析，對(duì)性能分析進(jìn)行優(yōu)化的過程。

38. 現(xiàn)有 1 億個(gè)整數(shù)均勻分布，如果要得到前 1K 個(gè)最大的數(shù)，求最優(yōu)的算法。

參見《海量數(shù)據(jù)算法面試大全》

39.mapreduce的大致流程

答：主要分為八個(gè)步驟

1/對(duì)文件進(jìn)行切片規(guī)劃

2/啟動(dòng)相應(yīng)數(shù)量的maptask進(jìn)程

3/調(diào)用FileInputFormat中的RecordReader，讀一行數(shù)據(jù)并封裝為k1v1

4/調(diào)用自定義的map函數(shù)，并將k1v1傳給map

5/收集map的輸出，進(jìn)行分區(qū)和排序

6/reduce task任務(wù)啟動(dòng)，并從map端拉取數(shù)據(jù)

7/reduce task調(diào)用自定義的reduce函數(shù)進(jìn)行處理

8/調(diào)用outputformat的recordwriter將結(jié)果數(shù)據(jù)輸出

41.用mapreduce實(shí)現(xiàn)sql語 select count (x) from a group by b;

44.搭建hadoop集群， master和slaves都運(yùn)行哪些服務(wù)

答：master主要是運(yùn)行我們的主節(jié)點(diǎn)，slaves主要是運(yùn)行我們的從節(jié)點(diǎn)。

45. hadoop參數(shù)調(diào)優(yōu)

46. pig , latin , hive語法有什么不同

答：

46. 描述Hbase，ZooKeeper搭建過程

48.hadoop運(yùn)行原理

答：hadoop的主要核心是由兩部分組成，HDFS和mapreduce，首先HDFS的原理就是分布式的文件存儲(chǔ)系統(tǒng)，將一個(gè)大的文件，分割成多個(gè)小的文件，進(jìn)行存儲(chǔ)在多臺(tái)服務(wù)器上。

Mapreduce的原理就是使用JobTracker和TaskTracker來進(jìn)行作業(yè)的執(zhí)行。Map就是將任務(wù)展開，reduce是匯總處理后的結(jié)果。

49.mapreduce的原理

答：mapreduce的原理就是將一個(gè)MapReduce框架由一個(gè)單獨(dú)的master JobTracker和每個(gè)集群節(jié)點(diǎn)一個(gè)slave TaskTracker共同組成。master負(fù)責(zé)調(diào)度構(gòu)成一個(gè)作業(yè)的所有任務(wù)，這些的slave上，master監(jiān)控它們的執(zhí)行，重新執(zhí)行已經(jīng)失敗的任務(wù)。而slave僅負(fù)責(zé)執(zhí)行由maste指派的任務(wù)。

50.HDFS存儲(chǔ)機(jī)制

答：HDFS主要是一個(gè)分布式的文件存儲(chǔ)系統(tǒng)，由namenode來接收用戶的操作請(qǐng)求，然后根據(jù)文件大小，以及定義的block塊的大小，將大的文件切分成多個(gè)block塊來進(jìn)行保存

51.舉一個(gè)例子說明mapreduce是怎么運(yùn)行的。

Wordcount

52.如何確認(rèn)hadoop集群的健康狀況

答：有完善的集群監(jiān)控體系（ganglia，nagios）

Hdfs dfsadmin –report

Hdfs haadmin –getServiceState nn1

53.mapreduce作業(yè)，不讓reduce輸出，用什么代替reduce的功能。

54.hive如何調(diào)優(yōu)

答：hive最終都會(huì)轉(zhuǎn)化為mapreduce的job來運(yùn)行，要想hive調(diào)優(yōu)，實(shí)際上就是mapreduce調(diào)優(yōu)，可以有下面幾個(gè)方面的調(diào)優(yōu)。解決收據(jù)傾斜問題，減少job數(shù)量，設(shè)置合理的map和reduce個(gè)數(shù)，對(duì)小文件進(jìn)行合并，優(yōu)化時(shí)把握整體，單個(gè)task最優(yōu)不如整體最優(yōu)。按照一定規(guī)則分區(qū)。

55.hive如何控制權(quán)限

我們公司沒做，不需要

56.HBase寫數(shù)據(jù)的原理是什么？

答：

57.hive能像關(guān)系型數(shù)據(jù)庫那樣建多個(gè)庫嗎？

答：當(dāng)然能了。

58.HBase宕機(jī)如何處理

答：宕機(jī)分為HMaster宕機(jī)和HRegisoner宕機(jī)，如果是HRegisoner宕機(jī)，HMaster會(huì)將其所管理的region重新分布到其他活動(dòng)的RegionServer上，由于數(shù)據(jù)和日志都持久在HDFS中，該操作不會(huì)導(dǎo)致數(shù)據(jù)丟失。所以數(shù)據(jù)的一致性和安全性是有保障的。

如果是HMaster宕機(jī)，HMaster沒有單點(diǎn)問題，HBase中可以啟動(dòng)多個(gè)HMaster，通過Zookeeper的Master Election機(jī)制保證總有一個(gè)Master運(yùn)行。即ZooKeeper會(huì)保證總會(huì)有一個(gè)HMaster在對(duì)外提供服務(wù)。

59.假設(shè)公司要建一個(gè)數(shù)據(jù)中心，你會(huì)如何處理？

先進(jìn)行需求調(diào)查分析

設(shè)計(jì)功能劃分

架構(gòu)設(shè)計(jì)

吞吐量的估算

采用的技術(shù)類型

軟硬件選型

成本效益的分析

項(xiàng)目管理

擴(kuò)展性

安全性，穩(wěn)定性

60. 單項(xiàng)選擇題

1. 下面哪個(gè)程序負(fù)責(zé) HDFS 數(shù)據(jù)存儲(chǔ)。答案 C

a)NameNode b)Jobtracker c)Datanoded)secondaryNameNode e)tasktracker

2. HDfS 中的 block 默認(rèn)保存幾份？答案 A

a)3 份 b)2 份 c)1 份 d)不確定

3. 下列哪個(gè)程序通常與 NameNode 在一個(gè)節(jié)點(diǎn)啟動(dòng)？

a)SecondaryNameNode b)DataNodec)TaskTracker d)Jobtracker e)zkfc

4. Hadoop 作者答案D

a)Martin Fowler b)Kent Beck c)Doug cutting

5. HDFS 默認(rèn) Block Size 答案 B [M1]

a)32MB b)64MB c)128MB

6. 下列哪項(xiàng)通常是集群的最主要瓶頸答案D[M2]

a)CPU b)網(wǎng)絡(luò) c)磁盤 d)內(nèi)存

7. 關(guān)于 SecondaryNameNode 哪項(xiàng)是正確的？答案C

a)它是NameNode的熱備

b)它對(duì)內(nèi)存沒有要求

c)它的目的是幫助 NameNode 合并編輯日志，減少 NameNode 啟動(dòng)時(shí)間

d)SecondaryNameNode 應(yīng)與 NameNode 部署到一個(gè)節(jié)點(diǎn)

多選題：

8. 下列哪項(xiàng)可以作為集群的管理工具答案 ABCD (此題出題有誤)

a)Puppet b)Pdsh c)Cloudera Manager d)Zookeeper

9. 配置機(jī)架感知[M3] 的下面哪項(xiàng)正確答案 ABC

a)如果一個(gè)機(jī)架出問題，不會(huì)影響數(shù)據(jù)讀寫

b)寫入數(shù)據(jù)的時(shí)候會(huì)寫到不同機(jī)架的 DataNode 中

c)MapReduce 會(huì)根據(jù)機(jī)架獲取離自己比較近的網(wǎng)絡(luò)數(shù)據(jù)

10. Client 端上傳文件的時(shí)候下列哪項(xiàng)正確答案BC

a)數(shù)據(jù)經(jīng)過 NameNode 傳遞給 DataNode

b)Client 端將文件切分為 Block，依次上傳

c)Client 只上傳數(shù)據(jù)到一臺(tái) DataNode，然后由 NameNode 負(fù)責(zé) Block 復(fù)制工作

11. 下列哪個(gè)是 Hadoop 運(yùn)行的模式答案 ABC

a)單機(jī)版 b)偽分布式 c)分布式

12. Cloudera 提供哪幾種安裝 CDH 的方法答案 ABCD

a)Cloudera manager b)Tar ball c)Yum d)Rpm

判斷題：全部都是錯(cuò)誤滴

13. Ganglia 不僅可以進(jìn)行監(jiān)控，也可以進(jìn)行告警。（）

14. Block Size 是不可以修改的。（）

15. Nagios 不可以監(jiān)控 Hadoop 集群，因?yàn)樗惶峁?nbsp; Hadoop 支持。（）

16. 如果 NameNode 意外終止， SecondaryNameNode 會(huì)接替它使集群繼續(xù)工作。（）

17. Cloudera CDH 是需要付費(fèi)使用的。（）

18. Hadoop 是 Java 開發(fā)的，所以 MapReduce 只支持 Java 語言編寫。（）

19. Hadoop 支持?jǐn)?shù)據(jù)的隨機(jī)讀寫。（）

20. NameNode 負(fù)責(zé)管理 metadata， client 端每次讀寫請(qǐng)求，它都會(huì)從磁盤中讀取或則

會(huì)寫入 metadata 信息并反饋 client 端。（）

21. NameNode 本地磁盤保存了 Block 的位置信息。（）

22. DataNode 通過長連接與 NameNode 保持通信。（）

23. Hadoop 自身具有嚴(yán)格的權(quán)限管理和安全措施保障集群正常運(yùn)行。（）

24. Slave節(jié)點(diǎn)要存儲(chǔ)數(shù)據(jù)，所以它的磁盤越大越好。（）

25. hadoop dfsadmin –report 命令用于檢測 HDFS 損壞塊。（）

26. Hadoop 默認(rèn)調(diào)度器策略為 FIFO（）

27. 集群內(nèi)每個(gè)節(jié)點(diǎn)都應(yīng)該配 RAID，這樣避免單磁盤損壞，影響整個(gè)節(jié)點(diǎn)運(yùn)行。（）

28. 因?yàn)?nbsp;HDFS 有多個(gè)副本，所以 NameNode 是不存在單點(diǎn)問題的。（）

29. 每個(gè) map 槽（進(jìn)程）就是一個(gè)線程。（）

30. Mapreduce 的 input split 就是一個(gè) block。（）

31. NameNode的默認(rèn)Web UI 端口是 50030，它通過 jetty 啟動(dòng)的 Web 服務(wù)。（）

32. Hadoop 環(huán)境變量中的 HADOOP_HEAPSIZE 用于設(shè)置所有 Hadoop 守護(hù)線程的內(nèi)存。它默認(rèn)是200 GB。（）

33. DataNode 首次加入 cluster 的時(shí)候，如果 log中報(bào)告不兼容文件版本，那需要

NameNode執(zhí)行“Hadoop namenode -format”操作格式化磁盤。（）

63. 談?wù)?hadoop1 和 hadoop2 的區(qū)別

答：

hadoop1的主要結(jié)構(gòu)是由HDFS和mapreduce組成的，HDFS主要是用來存儲(chǔ)數(shù)據(jù)，mapreduce主要是用來計(jì)算的，那么HDFS的數(shù)據(jù)是由namenode來存儲(chǔ)元數(shù)據(jù)信息，datanode來存儲(chǔ)數(shù)據(jù)的。Jobtracker接收用戶的操作請(qǐng)求之后去分配資源執(zhí)行task任務(wù)。

在hadoop2中，首先避免了namenode單點(diǎn)故障的問題，使用兩個(gè)namenode來組成namenode feduration的機(jī)構(gòu)，兩個(gè)namenode使用相同的命名空間，一個(gè)是standby狀態(tài)，一個(gè)是active狀態(tài)。用戶訪問的時(shí)候，訪問standby狀態(tài)，并且，使用journalnode來存儲(chǔ)數(shù)據(jù)的原信息，一個(gè)namenode負(fù)責(zé)讀取journalnode中的數(shù)據(jù)，一個(gè)namenode負(fù)責(zé)寫入journalnode中的數(shù)據(jù)，這個(gè)平臺(tái)組成了hadoop的HA就是high availableAbility高可靠。

然后在hadoop2中沒有了jobtracker的概念了，統(tǒng)一的使用yarn平臺(tái)來管理和調(diào)度資源，yarn平臺(tái)是由resourceManager和NodeManager來共同組成的，ResourceManager來接收用戶的操作請(qǐng)求之后，去NodeManager上面啟動(dòng)一個(gè)主線程負(fù)責(zé)資源分配的工作，然后分配好了資源之后告知ResourceManager，然后ResourceManager去對(duì)應(yīng)的機(jī)器上面執(zhí)行task任務(wù)。

64. 說說值對(duì)象與引用對(duì)象的區(qū)別？

65. 談?wù)勀銓?duì)反射機(jī)制的理解及其用途？

答：java中的反射，首先我們寫好的類，經(jīng)過編譯之后就編程了.class文件，我們可以獲取這個(gè)類的.class文件，獲取之后，再來操作這個(gè)類。這個(gè)就是java的反射機(jī)制。

66. ArrayList、Vector、LinkedList 的區(qū)別及其優(yōu)缺點(diǎn)？HashMap、HashTable 的區(qū)別及其優(yōu)缺點(diǎn)？

答：ArrayList 和Vector是采用數(shù)組方式存儲(chǔ)數(shù)據(jù)，，Vector由于使用了synchronized方法（線程安全）所以性能上比ArrayList要差，LinkedList使用雙向鏈表實(shí)現(xiàn)存儲(chǔ)，按序號(hào)索引數(shù)據(jù)需要進(jìn)行向前或向后遍歷，但是插入數(shù)據(jù)時(shí)只需要記錄本項(xiàng)的前后項(xiàng)即可，所以插入數(shù)度較快！

HashMap和HashTable：Hashtable的方法是同步的，而HashMap的方法不是，Hashtable是基于陳舊的Dictionary類的，HashMap是Java 1.2引進(jìn)的Map接口的一個(gè)實(shí)現(xiàn)。HashMap是一個(gè)線程不同步的，那么就意味著執(zhí)行效率高，HashTable是一個(gè)線程同步的就意味著執(zhí)行效率低，但是HashMap也可以將線程進(jìn)行同步，這就意味著，我們以后再使用中，盡量使用HashMap這個(gè)類。

67. 文件大小默認(rèn)為 64M，改為 128M 有啥影響？

答：更改文件的block塊大小，需要根據(jù)我們的實(shí)際生產(chǎn)中來更改block的大小，如果block定義的太小，大的文件都會(huì)被切分成太多的小文件，減慢用戶上傳效率，如果block定義的太大，那么太多的小文件可能都會(huì)存到一個(gè)block塊中，雖然不浪費(fèi)硬盤資源，可是還是會(huì)增加namenode的管理內(nèi)存壓力。

69. RPC 原理？

答：

1.調(diào)用客戶端句柄；執(zhí)行傳送參數(shù)

2.調(diào)用本地系統(tǒng)內(nèi)核發(fā)送網(wǎng)絡(luò)消息

3. 消息傳送到遠(yuǎn)程主機(jī)

4. 服務(wù)器句柄得到消息并取得參數(shù)

5. 執(zhí)行遠(yuǎn)程過程

6. 執(zhí)行的過程將結(jié)果返回服務(wù)器句柄

7. 服務(wù)器句柄返回結(jié)果，調(diào)用遠(yuǎn)程系統(tǒng)內(nèi)核

8. 消息傳回本地主機(jī)

9. 客戶句柄由內(nèi)核接收消息

10. 客戶接收句柄返回的數(shù)據(jù)

70. 對(duì) Hadoop 有沒有調(diào)優(yōu)經(jīng)驗(yàn)，沒有什么使用心得？（調(diào)優(yōu)從參數(shù)調(diào)優(yōu)講起）

dfs.block.size

Mapredure：

io.sort.mb

io.sort.spill.percent

mapred.local.dir

mapred.map.tasks &mapred.tasktracker.map.tasks.maximum

mapred.reduce.tasks &mapred.tasktracker.reduce.tasks.maximum

mapred.reduce.max.attempts

mapred.reduce.parallel.copies

mapreduce.reduce.shuffle.maxfetchfailures

mapred.child.java.opts

mapred.reduce.tasks.speculative.execution

mapred.compress.map.output &mapred.map.output.compression.codec

mapred.reduce.slowstart.completed.maps

72以你的實(shí)際經(jīng)驗(yàn)，說下怎樣預(yù)防全表掃描

答：

1.應(yīng)盡量避免在where 子句中對(duì)字段進(jìn)行null 值判斷，否則將導(dǎo)致引擎放棄使用索引而進(jìn)行全表掃描2.應(yīng)盡量避免在 where 子句中使用!=或<>操作符，否則將引擎放棄使用索引而進(jìn)行全表掃

3.描應(yīng)盡量避免在 where 子句中使用or 來連接條件，否則將導(dǎo)致引擎放棄使用索引而進(jìn)行

全表掃描

4.in 和 not in，用具體的字段列表代替，不要返回用不到的任何字段。in 也要慎用，否則會(huì)導(dǎo)致全表掃描

5.避免使用模糊查詢6.任何地方都不要使用select* from t

73. zookeeper 優(yōu)點(diǎn)，用在什么場合

答：極大方便分布式應(yīng)用的開發(fā)；（輕量，成本低，性能好，穩(wěn)定性和可靠性高）

75.把公鑰追加到授權(quán)文件的命令？該命令是否在 root 用戶下執(zhí)行？

答：ssh-copy-id

哪個(gè)用戶需要做免密登陸就在哪個(gè)用戶身份下執(zhí)行

76. HadoopHA 集群中各個(gè)服務(wù)的啟動(dòng)和關(guān)閉的順序？

答：

77. 在 hadoop 開發(fā)過程中使用過哪些算法？其應(yīng)用場景是什么？

答：排序，分組，topk，join，group

78. 在實(shí)際工作中使用過哪些集群的運(yùn)維工具，請(qǐng)分別闡述期作用。

答：nmon ganglia nagios

79. 一臺(tái)機(jī)器如何應(yīng)對(duì)那么多的請(qǐng)求訪問，高并發(fā)到底怎么實(shí)現(xiàn)，一個(gè)請(qǐng)求怎么產(chǎn)生的，

在服務(wù)端怎么處理的，最后怎么返回給用戶的，整個(gè)的環(huán)節(jié)操作系統(tǒng)是怎么控制的？

80. java 是傳值還是傳址？

答：引用傳遞。傳址

81. 問：你們的服務(wù)器有多少臺(tái)？

100多臺(tái)

82. 問：你們服務(wù)器的內(nèi)存多大？

128G或者64G的

83. hbase 怎么預(yù)分區(qū)？

建表時(shí)可以通過shell命令預(yù)分區(qū)，也可以在代碼中建表做預(yù)分區(qū)

《具體命令詳見筆記匯總》

84. hbase 怎么給 web 前臺(tái)提供接口來訪問（HTABLE可以提供對(duì) HBase的訪問，但是怎么查詢同一條記錄的多個(gè)版本數(shù)據(jù)）？

答：使用HTable來提供對(duì)HBase的訪問，可以使用時(shí)間戳來記錄一條數(shù)據(jù)的多個(gè)版本。

85. .htable API 有沒有線程安全問題，在程序中是單例還是多例？[M4]

多例：當(dāng)多線程去訪問同一個(gè)表的時(shí)候會(huì)有。

86. 你們的數(shù)據(jù)是用什么導(dǎo)入到數(shù)據(jù)庫的？導(dǎo)入到什么數(shù)據(jù)庫？

處理完成之后的導(dǎo)出：利用hive 處理完成之后的數(shù)據(jù)，通過sqoop 導(dǎo)出到 mysql 數(shù)據(jù)庫

中，以供報(bào)表層使用。

87. 你們業(yè)務(wù)數(shù)據(jù)量多大？有多少行數(shù)據(jù)？(面試了三家，都問這個(gè)問題)

開發(fā)時(shí)使用的是部分?jǐn)?shù)據(jù)，不是全量數(shù)據(jù)，有將近一億行（8、9 千萬，具體不詳，一般開

發(fā)中也沒人會(huì)特別關(guān)心這個(gè)問題）

88. 你們處理數(shù)據(jù)是直接讀數(shù)據(jù)庫的數(shù)據(jù)還是讀文本數(shù)據(jù)？

將日志數(shù)據(jù)導(dǎo)入到 hdfs 之后進(jìn)行處理

89. 你們寫 hive 的 hql 語句，大概有多少條？

不清楚，我自己寫的時(shí)候也沒有做過統(tǒng)計(jì)

90. 你們提交的 job 任務(wù)大概有多少個(gè)？這些job 執(zhí)行完大概用多少時(shí)間？(面試了三家，都問這個(gè)問題)

沒統(tǒng)計(jì)過，加上測試的，會(huì)有很多

Sca階段，一小時(shí)運(yùn)行一個(gè)job，處理時(shí)間約12分鐘

Etl階段，有2千多個(gè)job，從凌晨12:00開始次第執(zhí)行，到早上5點(diǎn)左右全部跑完

91. hive 跟 hbase 的區(qū)別是？

答：Hive和Hbase是兩種基于Hadoop的不同技術(shù)--Hive是一種類SQL的引擎，并且運(yùn)行MapReduce任務(wù)，Hbase是一種在Hadoop之上的NoSQL 的Key/vale數(shù)據(jù)庫。當(dāng)然，這兩種工具是可以同時(shí)使用的。就像用Google來搜索，用FaceBook進(jìn)行社交一樣，Hive可以用來進(jìn)行統(tǒng)計(jì)查詢，HBase可以用來進(jìn)行實(shí)時(shí)查詢，數(shù)據(jù)也可以從Hive寫到Hbase，設(shè)置再從Hbase寫回Hive。

92. 你在項(xiàng)目中主要的工作任務(wù)是？

Leader

預(yù)處理系統(tǒng)、手機(jī)位置實(shí)時(shí)查詢系統(tǒng)，詳單系統(tǒng)，sca行為軌跡增強(qiáng)子系統(tǒng)，內(nèi)容識(shí)別中的模板匹配抽取系統(tǒng)

設(shè)計(jì)、架構(gòu)、技術(shù)選型、質(zhì)量把控，進(jìn)度節(jié)點(diǎn)把握。。。。。。

93. 你在項(xiàng)目中遇到了哪些難題，是怎么解決的？

Storm獲取實(shí)時(shí)位置信息動(dòng)態(tài)端口的需求

101. job 的運(yùn)行流程(提交一個(gè) job 的流程)？

102Hadoop 生態(tài)圈中各種框架的運(yùn)用場景？

103. hive 中的壓縮格式 RCFile、TextFile、SequenceFile [M5] 各有什么區(qū)別？

以上 3 種格式一樣大的文件哪個(gè)占用空間大小..等等

采用RCfile的格式讀取的數(shù)據(jù)量（373.94MB）遠(yuǎn)遠(yuǎn)小于sequenceFile的讀取量（2.59GB）

2、執(zhí)行速度前者(68秒)比后者(194秒)快很多

從以上的運(yùn)行進(jìn)度看，snappy的執(zhí)行進(jìn)度遠(yuǎn)遠(yuǎn)高于bz的執(zhí)行進(jìn)度。

在hive中使用壓縮需要靈活的方式，如果是數(shù)據(jù)源的話，采用RCFile+bz或RCFile+gz的方式，這樣可以很大程度上節(jié)省磁盤空間；而在計(jì)算的過程中，為了不影響執(zhí)行的速度，可以浪費(fèi)一點(diǎn)磁盤空間，建議采用RCFile+snappy的方式，這樣可以整體提升hive的執(zhí)行速度。

至于lzo的方式，也可以在計(jì)算過程中使用，只不過綜合考慮（速度和壓縮比）還是考慮snappy適宜。

104假如：Flume 收集到的數(shù)據(jù)很多個(gè)小文件,我需要寫 MR 處理時(shí)將這些文件合并

(是在 MR 中進(jìn)行優(yōu)化,不讓一個(gè)小文件一個(gè) MapReduce)

他們公司主要做的是中國電信的流量計(jì)費(fèi)為主,專門寫 MR。

105. 解釋“hadoop”和“hadoop 生態(tài)系統(tǒng)”兩個(gè)概念

109. MapReduce 2.0”與“YARN”是否等同，嘗試解釋說明

MapReduce 2.0 --àmapreduce + yarn

110. MapReduce 2.0 中，MRAppMaster 主要作用是什么，MRAppMaster 如何實(shí)現(xiàn)任務(wù)

容錯(cuò)的？

111. 為什么會(huì)產(chǎn)生 yarn,它解決了什么問題，有什么優(yōu)勢(shì)？

114. 數(shù)據(jù)備份,你們是多少份,如果數(shù)據(jù)超過存儲(chǔ)容量,你們?cè)趺刺幚恚?/p>

3份，多加幾個(gè)節(jié)點(diǎn)

115. 怎么提升多個(gè) JOB 同時(shí)執(zhí)行帶來的壓力,如何優(yōu)化,說說思路？

增加運(yùn)算能力

116. 你們用 HBASE 存儲(chǔ)什么數(shù)據(jù)？

流量詳單

117. 你們的 hive 處理數(shù)據(jù)能達(dá)到的指標(biāo)是多少？

118.hadoop中RecorderReader的作用是什么？？？

1、在hadoop中定義的主要公用InputFormat中，哪個(gè)是默認(rèn)值？ FileInputFormat

2、兩個(gè)類TextInputFormat和KeyValueInputFormat的區(qū)別是什么？

答：TextInputFormat主要是用來格式化輸入的文本文件的，KeyValueInputFormat則主要是用來指定輸入輸出的key,value類型的

3、在一個(gè)運(yùn)行的hadoop任務(wù)中，什么是InputSplit？

InputSplit是InputFormat中的一個(gè)方法，主要是用來切割輸入文件的，將輸入文件切分成多個(gè)小文件，

然后每個(gè)小文件對(duì)應(yīng)一個(gè)map任務(wù)

4、 Hadoop框架中文件拆分是怎么調(diào)用的？

InputFormat --> TextInputFormat -->RecordReader --> LineRecordReader --> LineReader

5、參考下列M/R系統(tǒng)的場景：hdfs塊大小為64MB，輸入類為FileInputFormat，有3個(gè)文件的大小分別為64KB, 65MB, 127MB

會(huì)產(chǎn)生多少個(gè)maptask 4個(gè) 65M這個(gè)文件只有一個(gè)切片《原因參見筆記匯總TextInputformat源碼分析部分》

8、如果沒有自定義partitioner，那數(shù)據(jù)在被送達(dá)reducer前是如何被分區(qū)的？

hadoop有一個(gè)默認(rèn)的分區(qū)類，HashPartioer類，通過對(duì)輸入的k2去hash值來確認(rèn)map輸出的k2,v2送到哪一個(gè)reduce中去執(zhí)行。

10、分別舉例什么情況要使用 combiner，什么情況不使用？

求平均數(shù)的時(shí)候就不需要用combiner，因?yàn)椴粫?huì)減少reduce執(zhí)行數(shù)量。在其他的時(shí)候，可以依據(jù)情況，使用combiner，來減少map的輸出數(shù)量，減少拷貝到reduce的文件，從而減輕reduce的壓力，節(jié)省網(wǎng)絡(luò)開銷，提升執(zhí)行效率

11、Hadoop中job和tasks之間的區(qū)別是什么？

Job是我們對(duì)一個(gè)完整的mapreduce程序的抽象封裝

Task是job運(yùn)行時(shí)，每一個(gè)處理階段的具體實(shí)例，如map task，reduce task，maptask和reduce task都會(huì)有多個(gè)并發(fā)運(yùn)行的實(shí)例

12、hadoop中通過拆分任務(wù)到多個(gè)節(jié)點(diǎn)運(yùn)行來實(shí)現(xiàn)并行計(jì)算，但某些節(jié)點(diǎn)運(yùn)行較慢會(huì)拖慢整個(gè)任務(wù)的運(yùn)行，hadoop采用全程機(jī)制應(yīng)對(duì)這個(gè)情況？

Speculate 推測執(zhí)行

14、有可能使hadoop任務(wù)輸出到多個(gè)目錄中嗎？如果可以，怎么做？

自定義outputformat或者用multioutputs工具

15、如何為一個(gè)hadoop任務(wù)設(shè)置mappers的數(shù)量？

Split機(jī)制

16、如何為一個(gè)hadoop任務(wù)設(shè)置要?jiǎng)?chuàng)建reduder的數(shù)量？

可以通過代碼設(shè)置

具體設(shè)置多少個(gè)，應(yīng)該根據(jù)硬件配置和業(yè)務(wù)處理的類型來決定

下面是HBASE我非常不懂的地方：

1.hbase怎么預(yù)分區(qū)？

2.hbase怎么給web前臺(tái)提供接口來訪問（HTABLE可以提供對(duì)HTABLE的訪問，但是怎么查詢同一條記錄的多個(gè)版本數(shù)據(jù)）？

3.htable API有沒有線程安全問題，在程序中是單例還是多例？

4.我們的hbase大概在公司業(yè)務(wù)中（主要是網(wǎng)上商城）大概4個(gè)表，幾個(gè)表簇，大概都存什么樣的數(shù)據(jù)？

下面的Storm的問題：

1.metaq消息隊(duì)列 zookeeper集群 storm集群（包括zeromq,jzmq,和storm本身）就可以完成對(duì)商城推薦系統(tǒng)功能嗎？還有沒有其他的中間件？

mahout

2.storm怎么完成對(duì)單詞的計(jì)數(shù)？

下文引用自神之子《hadoop面試可能遇到的問題》

Q1. Name the most common InputFormats defined in Hadoop? Which one is default ? Following 2 are most common InputFormats defined in Hadoop - TextInputFormat- KeyValueInputFormat- SequenceFileInputFormat

Q2. What is the difference betweenTextInputFormatand KeyValueInputFormat class TextInputFormat:It reads lines of text files and provides the offset of the line as key to theMapper and actual line as Value to the mapperKeyValueInputFormat:Reads text file and parses lines into key, val pairs. Everything up to thefirst tab character is sent as key to the Mapper and the remainder of the lineis sent as value to the mapper.

Q3. What is InputSplit in Hadoop

When a hadoop job is run, it splits input files into chunksand assign each split to a mapper to process. This is called Input Split

Q4. How is the splitting of file invokedin Hadoop Framework

It is invoked by the Hadoop framework by runninggetInputSplit()method of the Input format class (like FileInputFormat) definedby the user

Q5. Consider case scenario: In M/R system,

- HDFS block size is 64MB

- Input format isFileInputFormat

- We have 3 files ofsize 64K, 65Mb and 127Mb

then how many input splits will be madeby Hadoop framework?

Hadoop will make 5splits as follows

- 1 split for 64K files

- 2 splits for 65Mb files

- 2 splits for 127Mb file

Q6. What is the purpose of RecordReaderin Hadoop

The InputSplithas defined a slice of work,but does not describe how to access it. The RecordReaderclass actually loadsthe data from its source and converts it into (key, value) pairs suitable forreading by the Mapper. The RecordReader instance is defined by theInputFormat

Q7. After the Map phase finishes,the hadoop framework does"Partitioning, Shuffle and sort". Explain what happens in this phase?

- Partitioning

Partitioning is the process of determiningwhich reducer instance will receive which intermediate keys and values. Eachmapper must determine for all of its output (key, value) pairs which reducerwill receive them. It is necessary that for any key, regardless of which mapperinstance generated it, the destination partition is the same

- Shuffle

After the first map tasks have completed,the nodes may still be performing several more map tasks each. But they alsobegin exchanging the intermediate outputs from the map tasks to where they arerequired by the reducers. This process of moving map outputs to the reducers isknown as shuffling.

- Sort

Each reduce task is responsible forreducing the values associated with several intermediate keys. The set of intermediatekeys on a single node is automatically sorted by Hadoop before they are presented tothe Reducer

Q9. If no custom partitioner is defined inthe hadoop then how is datapartitioned before its sent to the reducer

The default partitioner computes a hashvalue for the key and assigns the partition based on this result

Q10. What is a Combiner

The Combiner is a "mini-reduce"process which Operates only on data generated by a mapper. The Combiner willreceive as input all data emitted by the Mapper instances on a given node. Theoutput from the Combiner is then sent to the Reducers, instead of the outputfrom the Mappers.

Q11. Give an example scenario where a cobiner can be usedand where it cannot be used

There can be several examples following are the mostcommon ones

- Scenario where you can use combiner

Getting list of distinct words in a file

- Scenario where you cannot use a combiner

Calculating mean of a list of numbers

Q12. What is job tracker

Job Tracker is the service within Hadoop that runs Map Reduce jobs onthe cluster

Q13. What are some typical functions ofJob Tracker

The following are some typical tasks ofJob Tracker

- Accepts jobs from clients

- It talks to the NameNode todetermine the location of the data

- It locates TaskTracker nodes withavailable slots at or near the data

- It submits the work to the chosenTask Tracker nodes and monitors progress of each task by receiving heartbeatsignals from Task tracker

Q14. What is task tracker

Task Tracker is a node in the cluster thataccepts tasks like Map, Reduce and Shuffle operations - from a JobTracker

Q15. Whats the relationship between Jobsand Tasks in Hadoop

One job is broken down into one or manytasks in Hadoop.

Q16. Suppose Hadoop spawned 100 tasks for a job and one of the taskfailed. What willhadoop do ?

It will restart the task again on someother task tracker and only if the task fails more than 4 (default setting andcan be changed) times will it kill the job

Q17. Hadoop achievesparallelism by dividing the tasks across many nodes, it is possible for a fewslow nodes to rate-limit the rest of the program and slow down the program.What mechanism Hadoop providesto combat this

Speculative Execution

Q18. How does speculative execution worksin Hadoop

Job tracker makes different task trackersprocess same input. When tasks complete, they announce this fact to the JobTracker. Whichever copy of a task finishes first becomes the definitive copy.If other copies were executing speculatively, Hadoop tells the Task Trackers to abandon the tasksand discard their outputs. The Reducers then receive their inputs fromwhichever Mapper completed successfully, first.

Q19. Using command line in linux, how willyou

- see all jobs running in the hadoop cluster

- kill a job

- hadoop job-list

- hadoop job-kill jobid

Q20. What is Hadoop Streaming

Streaming is a generic API that allowsprograms written in virtually any language to be used asHadoop Mapper and Reducerimplementations

Q21. What is the characteristic of streamingAPI that makes it flexible run map reduce jobs in languages like perl, ruby,awk etc.

Hadoop Streamingallows to use arbitrary programs for the Mapper and Reducer phases of a MapReduce job by having both Mappers and Reducers receive their input on stdin andemit output (key, value) pairs on stdout.

Q22. Whats is Distributed Cache in HadoopDistributed Cache is a facility provided by the Map/Reduce framework to cachefiles (text, archives, jars and so on) needed by applications during executionof the job. The framework will copy the necessary files to the slave nodebefore any tasks for the job are executed on that node.

Q23. What is the benifit of Distributedcache, why can we just have the file in HDFS and have the application read it This is because distributed cache is much faster. It copies the file to alltrackers at the start of the job. Now if the task tracker runs 10 or 100mappers or reducer, it will use the same copy of distributed cache. On theother hand, if you put code in file to read it from HDFS in the MR job thenevery mapper will try to access it from HDFS hence if a task tracker run 100map jobs then it will try to read this file 100 times from HDFS. Also HDFS isnot very efficient when used like this.Q.24 What mechanism does Hadoop frameworkprovides to synchronize changes made in Distribution Cache during runtime ofthe application This is a trick questions.There is no such mechanism. Distributed Cache by design is read only during thetime of Job executionQ25. Have you ever used Counters in Hadoop.Give us an example scenario Anybody who claims to have worked on a Hadoop projectis expected to use countersQ26. Is it possible to provide multiple input to Hadoop? If yes then how can you give multipledirectories as input to the Hadoop job Yes, The input format class provides methods to add multiple directories asinput to a Hadoop jobQ27. Is it possible to have Hadoop joboutput in multiple directories. If yes then how Yes, by using Multiple Outputs classQ28. What will a hadoop jobdo if you try to run it with an output directory that is already present? Willit- overwrite it- warn you and continue- throw an exception and exitThe hadoop job willthrow an exception and exit.Q29. How can you set an arbitary number of mappers to be created for ajob in Hadoop This is a trick question. You cannot set itQ30. How can you set an arbitary number of reducers to be created for ajob in Hadoop You can either do it progamatically by using method setNumReduceTasksin theJobConfclass or set it up as a configuration setting