tpcpr/README.md

5.9 KiB

This is experimental

The software is written in Rust (2018, safe mode only). At the current state I am having fun writing in Rust and testing language features. The code should be modular enough to change any function you deem awful enough. Error handling is subpar at the moment. There is no real unit testing to speak of since switching to asynchronous functionality. Testing will come back.

This version is a successor of the _POSIX_C_SOURCE 200809L implementation in which all data parsed from a cap/pcapng files is written as a single and simple query. The ingestion time is rather fast (tested writes: 100*10^3 tcp packets in ~1.8 sec) but may be insecure. See the other repository for more information. The idea of this iteration is to use a prepared statement and chunk the data according to maximum input. Postgres databases have a custom maximum limit on each insert query of prepared statements. Said chunk size is initialized through the config/interface file called parser.json as insert_max. Data can be read from PCAP/PCANG files, as well as network devices.

Process is as follows:

  • Choose between network device (specify it as well) or file input
  • Choosing device is straight forward -> data gets parsed, chunked and queries prepared according to insert_max size
  • Encapsulation type / Linktype is chosen in beforehand. Currently Ethernet and RawIp is supported.
  • Choosing file input means selecting a directory where your PCAP/PCAPNG files reside.
  • A hash map is created out of key(paths):value(metadata) out of pcap files found in the specified directory.
  • The parser gets invoked, which itself calls the appropriate protocol handler on to the byte data of packetsi yielded by pcap. A vector of type QryData is returned after EOF has been hit.
  • QryData vector is serialized.
  • Serialized data gets chunked.
  • Prepared statements are prepared according to chunksize
  • Queried data gets queried in chunks afterwards

Currently, ethernet, IPv4, IPV6, TCP, UDP and ARP/RARP network protocols are handled. Because of testing purposes, layout of the table is serialized json. Table layout is somewhat "dynamic". Any procotols not recognized in a parsed packet will be marked as NULL inside a resulting table row. A query may look like this select packet from json_dump where packet->>'ipv4_header' is not null;

UPDATE 0.2.0: Chunking can be omitted completely when using PostgreSQL's COPY transferring binary data instead of using Insert. This is not only somewhat faster -- not as much as I expectedi, unfortunately -- but there are quite a few lines of code less in the end. Only parsing fromnetwork device still needs chunks. The other recent change is that only none NULL protocols data of a packet is serialized to json. Table insertion should be smaller this way.

Speaking of serialization: After profiling it turns out that ~20% of cpu time is used for serialization to json. This, of course, could be saved completely.

Another subgoal was the ability to compile a static binary, which --last time I tested-- works without dependencies, but the need for libpcap itself. It even executes on oracle linux, after linking against the elf64 interpreter in a direct manner. If you ever had the pleasure using this derivate it may come as a suprise to you. The key is to compile via x86_64-unknown-linux-musl target. See: https://doc.rust-lang.org/edition-guide/rust-2018/platform-and-target-support/musl-support-for-fully-static-binaries.html

Caveats: Regex Syntax is limited at the moment, because it is not compiled from a Rawstring, but a common one. Escaping does not work properly, character classes do. I have to fiddle the correct synctactical way to get it out of the json file and into a raw. For already supported regular expression syntax see: https://docs.rs/regex/1.3.9/regex/#syntax , also see the example in parser.json. Transmitting all the data of the formerly described testing table layout results in a rather big table size. HDD space was no issue so far. Ingest of 30808676 TCP/IP Packets taken from iCTF 2020 PCAPs results in 99.4GB of json data. See: https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources for more details.

Gotchas: My test setup consists of a postgresql db inside a docker container. Main memory usage of said container is low ~300MB, but I had to set --oom-score-adj=999 in order to not get the container quit automatically. --oom-kill-disable=false would turn it off complete, I guess. I did no fine tuning of this value, yet.

If this whole thing turns out to be viable, some future features may be:

  • Database containing file hash map to compare file status/sizes after the parser may have crashed, or to join a complete overview of any existing PCAP files.
  • Concurrency. There are some interresting ways of parallelization I am working on to find a model that really benefits the use case. MPSC looks promising at the moment. Inplementing a MPSC pipe has the nice side effect of lower memory usage, parsed packages will directly be piped to json serialization function without beeing stored in a separate vector. In the sense of pcap from config -> parser (without vec usage) -> serializer -> insertion.
  • Update file hashmap through inotify crate, during runtime.
  • Restoration of fragmented ipv4 packages.
  • SIMD (via autovectorization). Which is easy enough to do in Rust.
  • Support of more protocols

There are many other things left to be desired.

Bechmarking was done with the identical file that was used in the previous C implementation. Inserting none chunked data resulted in ~20 minutes of querying to database. Now, chunked data is below 12 seconds after compiler optimization.

Speaking of optimization: Do yourself a favor an run release code not debug code: cargo run --release. The compiler does a rather hefty optimization and you will save some time waiting for your precious data do be inserted. I did no further optimization besides trying to enable the compiler to do a better job. Just blackboxing, no assembly tweaking yet.