tpcpr/README.md

52 lines
5.9 KiB
Markdown
Raw Normal View History

2020-05-13 10:55:56 +02:00
# This is experimental
2020-06-19 02:03:28 +02:00
The software is written in Rust (2018, safe mode only). At the current state I am having fun writing in Rust and testing language features. The code should be modular enough to change any function you deem awful enough.
2020-06-18 01:45:12 +02:00
Error handling is subpar at the moment. There is no real unit testing to speak of since switching to asynchronous functionality. Testing will come back.
2020-05-13 10:55:56 +02:00
2020-06-21 20:24:14 +02:00
This version is a successor of the _POSIX_C_SOURCE 200809L implementation in which all data parsed from a pcap/pcapng files is written as a single and simple query. The ingestion time is rather fast (tested writes: 100*10^3 tcp packets in ~1.8 sec) but may be insecure. See the other repository for more information.
2020-06-15 20:49:13 +02:00
The idea of this iteration is to use a prepared statement and chunk the data according to maximum input. Postgres databases have a custom maximum limit on each insert query of prepared statements. Said chunk size is initialized through the config/interface file called parser.json as `insert_max`. Data can be read from PCAP/PCANG files, as well as network devices.
2020-06-21 20:24:14 +02:00
**UPDATE 0.2.0**: Chunking can be omitted completely when using PostgreSQL's `COPY` transferring binary data instead of using `Insert`. This is not only somewhat faster, but there are quite a few lines of code less in the end. Only parsing from network device still needs chunks.
The other recent change is that only none NULL protocol data of a packet is serialized to json. Table insertion should be smaller this way.
2020-06-15 20:49:13 +02:00
Process is as follows:
2020-06-18 01:45:12 +02:00
- Choose between network device (specify it as well) or file input
2020-06-15 20:49:13 +02:00
- Choosing device is straight forward -> data gets parsed, chunked and queries prepared according to `insert_max` size
2020-06-15 21:00:55 +02:00
- Encapsulation type / Linktype is chosen in beforehand. Currently Ethernet and RawIp is supported.
2020-06-15 20:49:13 +02:00
- Choosing file input means selecting a directory where your PCAP/PCAPNG files reside.
2020-06-18 01:45:12 +02:00
- A hash map is created out of key(paths):value(metadata) out of pcap files found in the specified directory.
- The parser gets invoked, which itself calls the appropriate protocol handler on to the byte data of packetsi yielded by pcap. A vector of type QryData is returned after EOF has been hit.
2020-06-15 20:49:13 +02:00
- QryData vector is serialized.
- Serialized data gets chunked.
- Prepared statements are prepared according to chunksize
- Queried data gets queried in chunks afterwards
2020-06-15 21:00:55 +02:00
Currently, ethernet, IPv4, IPV6, TCP, UDP and ARP/RARP network protocols are handled.
2020-06-18 01:45:12 +02:00
Because of testing purposes, layout of the table is serialized json. Table layout is somewhat "dynamic". Any procotols not recognized in a parsed packet will be marked as NULL inside a resulting table row.
2020-06-15 20:49:13 +02:00
A query may look like this `select packet from json_dump where packet->>'ipv4_header' is not null;`
Speaking of serialization: After profiling it turns out that ~20% of cpu time is used for serialization to json. This, of course, could be saved completely.
2020-06-18 01:45:12 +02:00
Another subgoal was the ability to compile a static binary, which --last time I tested-- works without dependencies, but the need for libpcap itself. It even executes on oracle linux, after linking against the elf64 interpreter in a direct manner. If you ever had the pleasure using this derivate it may come as a suprise to you. The key is to compile via `x86_64-unknown-linux-musl` target. See: https://doc.rust-lang.org/edition-guide/rust-2018/platform-and-target-support/musl-support-for-fully-static-binaries.html
2020-06-15 20:49:13 +02:00
2020-06-19 02:03:28 +02:00
Caveats: Regex Syntax is limited at the moment, because it is not compiled from a Rawstring, but a common one. Escaping does not work properly, character classes do. I have to fiddle the correct synctactical way to get it out of the json file and into a raw. For already supported regular expression syntax see: https://docs.rs/regex/1.3.9/regex/#syntax , also see the example in `parser.json`.
Transmitting all the data of the formerly described testing table layout results in a rather big table size. HDD space was no issue so far. Ingest of 30808676 TCP/IP Packets taken from iCTF 2020 PCAPs results in 99.4GB of json data. See: https://docs.docker.com/engine/reference/run/#runtime-constraints-on-resources for more details.
2020-06-21 20:08:35 +02:00
Gotchas: My test setup consists of a postgresql db inside a docker container. Main memory usage of said container is low ~300MB, but I had to set `--oom-score-adj=999` in order to not get the container quit automatically. `--oom-kill-disable=false` would turn it off complete, I guess. I did no fine tuning of this value, yet.
2020-06-19 02:03:28 +02:00
2020-06-15 20:49:13 +02:00
If this whole thing turns out to be viable, some future features may be:
- Database containing file hash map to compare file status/sizes after the parser may have crashed, or to join a complete overview of any existing PCAP files.
2020-06-21 20:08:35 +02:00
- Concurrency. There are some interresting ways of parallelization I am working on to find a model that really benefits the use case. MPSC looks promising at the moment. Inplementing a MPSC pipe has the nice side effect of lower memory usage, parsed packages will directly be piped to json serialization function without beeing stored in a separate vector. In the sense of pcap from config -> parser (without vec usage) -> serializer -> insertion.
2020-06-15 20:49:13 +02:00
- Update file hashmap through inotify crate, during runtime.
- Restoration of fragmented ipv4 packages.
- SIMD (via autovectorization). Which is easy enough to do in Rust.
2020-06-18 20:22:18 +02:00
- Support of more protocols
2020-06-15 20:49:13 +02:00
There are many other things left to be desired.
2020-06-21 20:08:35 +02:00
Bechmarking was done with the identical file that was used in the previous C implementation. Inserting none chunked data resulted in ~20 minutes of querying to database. Now, chunked data is below 12 seconds after compiler optimization.
2020-06-15 20:49:13 +02:00
Speaking of optimization: Do yourself a favor an run release code not debug code: `cargo run --release`. The compiler does a rather hefty optimization and you will save some time waiting for your precious data do be inserted. I did no further optimization besides trying to enable the compiler to do a better job. Just blackboxing, no assembly tweaking yet.