Hello,
while working on the Lumberjack message processing "pipeline", a design question has come up that probably must be a known issue for logging in general, so I'd like to ask before we start implementing.
What to do about "trusted" properties coming from untrusted input sources?
In particular, we are aiming at an rsyslog configuration where anything passed to /dev/log is automatically annotated with "pid", "uid" and other fields; at that point the information is 100% reliable.
Similarly, syslog messages can come from remote hosts over TCP, and if those hosts are similarly configured by the same administrators, the data is reliable as well.
OTOH, if syslog messages come from hosts that are not under the same control (e.g. data coming from user-administered machines), this data is not "reliable" - what to do about it?
We have come up with three possibilities:
a) Just leave it in, and let the user filter the data out based on other properties (e.g. host name) at the time of log searching (or just let the user notice when reading the log message ad hoc) - some queries and statistics may have misleading results.
b) Delete the untrusted fields - drops data.
c) Somehow mark the fields as "untrusted" - preserves all data, and allows queries that ignore it.
c1) Is it any really any better than just filtering on host names as in a)?
c2) How can we mark the fields as "untrusted", in particular in the context of Lumberjack?
I'm leaning towards a) as the UNIX-tradition "worse is better" solution, but I really don't have that much experience in logging, so any feedback would be appreciated.
Mirek