killchain-compendium/Miscellaneous/Hadoop.md

2.0 KiB

Hadoop

Distributed storage and computing

Terminology

  • Cluster, forms the datalake

  • Node, single host inside the cluster

  • NameNode, node that keeps the dir tree of the Hadoop file system

  • DataNode, slave node that stores files and is instructed by the NameNode

  • Primary NameNode, current active node responsible for keeping the directory structure

  • Secondary NameNode, hot standby for Primary NameNode. There may be multiple on standby inside the cluster

  • Master Node, Hadoop management app like HDFS or YARN Manager

  • Slave Node, Hadoop worker like HDFS or MapReduce. a node can be master and slave at the same time

  • Edge Node, hosting Hadoop user app like Zeppelin or Hue

  • Kerberised, security enabled cluster through Kerberos

  • HDFS, Hadoop Distributed File System, storage device for unstructured data

  • Hive, primary DB for structured data

  • YARN, scheduling jobs and resource management

  • MapReduce, distributed filtering, sorting and reducing

  • HUE, GUI for HDFS and Hive

  • Zookeeper, cluster management

  • Kafka, message broker

  • Ranger, privileged ACL

  • Zeppelin, data analytivs inside a webUI

Zeppelin

Ktabs

  • Finding ktpasses to authenticate at the kerberos TGS
  • Output principals and use them to init
klist -k <keytabfile>
kinit <prinicpal name> -k -V -t <keytabfile>

HDFS

  • User the hdfs utility to enumerate the distributed network storage
hdfs dfs -ls /
  • Current user and user on the storage do not have to correspond
  • Touched files on the storage may be owned by root
hdfs dfs -touchz  testfile /tmp/testfile
hdfs dfs -ls /tmp
  • Impersonate by sourcing keytab file of the user, NodeManager is the highest user in regards to permission