I would love to know how to reach out to a development community with long-term discussions where I can discuss ideas I'm working on for various open source projects. I feel like I can work on something in a vacuum for months, make a post somewhere about it, and get people to look at it for a few minutes and that is about it. It would be much nicer to just casually discuss architecture ideas before I commit to something though.
So my current project is an enhanced tar utility. This comes out of a "tarcrypt" which takes an inbound tar file (created by something like GNU tar, but other formats should hopefully work) and adds compressed RSA/AES encryption individual files while maintaining the overall tar structure (https://www.snebu.com/tarcrypt). The purpose was to add encryption capabilities to Snebu backup (which I posted on here previously), which uses tar as a serialization format to collect files (that way no client agent needs to be deployed).
I'm turning Tarcrypt into a standalone tar utility so I can add a few additional feature that one of my tar extensions enables. You see, a tar file consists of a 512-byte header that has all the metadata of the file, including the length of the file, followed by the file contents in successive 512-byte blocks. This means that since tar is a streaming format, you need to know how long the file is at the time you write the header. Which leads to if you do encryption, you can't compress first unless you write out to a temp file, then write the header and the compressed/encrypted file contents.
The way I solved this is to turn the file name into a directory name, with successive files sequentially numbered in that directory. So that the compression / encryption can be done streaming, going to a buffer in RAM (say, a 10-meg buffer), and when that buffer fill up, write out a header followed by that segment. The last segment has a marker that tells it that this is the last segment of the file. Additional metadata required for this is stored in PAX headers (which is a POSIX tar extension that allows for unlimited key-value pairs to be associated with a logical tar file entry).
In addition, using the multi-segment extension I've developed, I can now have one-pass sparse file support (currently sparse file processing requires two passes to detect the "holes" in a file, although the first pass can be sped up if a filesystem supports "seek_hole" and "seek_data").
My final improvement would be to append an index at the end of the tar file. The format calls for two 512-byte null blocks to signal the end of a tar, and most tar utilities stop processing there. So you can append additional info at the end such as an index the byte position of each file, with the last 8 bytes of the last block being a pointer back to the starting byte of the index. And if the overall file is compressed (instead of just individual file entries), if a block-based compression method is utilized then the index could start on a compression block boundary, and contain the mapping of the beginning of the compression block that proceeds each logical file header.
Now as you can see there is a number of decisions I have had to made (and still need to make), which is where it would be nice if there was still something like a comp.unix.programming group I could drop into (Reddit threads are to ephemeral). Maybe I could drop in on the gnu tar list? I've seen other discussions like this in the past on there (I'd really like to see my improvements make it to GNU tar also, but I still will be coding my own implementation for other purposes).
So my current project is an enhanced tar utility. This comes out of a "tarcrypt" which takes an inbound tar file (created by something like GNU tar, but other formats should hopefully work) and adds compressed RSA/AES encryption individual files while maintaining the overall tar structure (https://www.snebu.com/tarcrypt). The purpose was to add encryption capabilities to Snebu backup (which I posted on here previously), which uses tar as a serialization format to collect files (that way no client agent needs to be deployed).
I'm turning Tarcrypt into a standalone tar utility so I can add a few additional feature that one of my tar extensions enables. You see, a tar file consists of a 512-byte header that has all the metadata of the file, including the length of the file, followed by the file contents in successive 512-byte blocks. This means that since tar is a streaming format, you need to know how long the file is at the time you write the header. Which leads to if you do encryption, you can't compress first unless you write out to a temp file, then write the header and the compressed/encrypted file contents.
The way I solved this is to turn the file name into a directory name, with successive files sequentially numbered in that directory. So that the compression / encryption can be done streaming, going to a buffer in RAM (say, a 10-meg buffer), and when that buffer fill up, write out a header followed by that segment. The last segment has a marker that tells it that this is the last segment of the file. Additional metadata required for this is stored in PAX headers (which is a POSIX tar extension that allows for unlimited key-value pairs to be associated with a logical tar file entry).
In addition, using the multi-segment extension I've developed, I can now have one-pass sparse file support (currently sparse file processing requires two passes to detect the "holes" in a file, although the first pass can be sped up if a filesystem supports "seek_hole" and "seek_data").
My final improvement would be to append an index at the end of the tar file. The format calls for two 512-byte null blocks to signal the end of a tar, and most tar utilities stop processing there. So you can append additional info at the end such as an index the byte position of each file, with the last 8 bytes of the last block being a pointer back to the starting byte of the index. And if the overall file is compressed (instead of just individual file entries), if a block-based compression method is utilized then the index could start on a compression block boundary, and contain the mapping of the beginning of the compression block that proceeds each logical file header.
Now as you can see there is a number of decisions I have had to made (and still need to make), which is where it would be nice if there was still something like a comp.unix.programming group I could drop into (Reddit threads are to ephemeral). Maybe I could drop in on the gnu tar list? I've seen other discussions like this in the past on there (I'd really like to see my improvements make it to GNU tar also, but I still will be coding my own implementation for other purposes).