So I got to know on the flock that fedmsg is going to be replaced?
Anyway, it seems that there is an idea to create schemas for the messages and distribute them in packages? And those python packages need to be present on producer as well as consumer?
JSON schemas
Message bodies are JSON objects, that adhere to a schema. Message schemas
live in their own Python package, so they can be installed on the producer and on the consumer.
Could we instead just send the message schemas together with the message content always?
I would like to be able to parse any message I receive without some additional packages installed. If I am about to start listening to a new message type, I don't want to spend time to be looking up what i should install to make it work. It should just work. Requiring to have some packages with schemas installed on consumer and having to maintain them by the producer does not seem that great idea. Mainly because one of the raising requirements for fedmsg was that it should be made a generic messaging framework easily usable outside of Fedora Infrastructure. We should make it easy for anyone outside to be able to listen and understand our messages so that they can react to them. Needing to have some python packages installed (how are they going to be distributed PyPI + fedora ?) seems to be just an unnecessary hassle. So can we send a schema with each message as documentation and validation of the message itself?
a) it will make our life easier
b) it will allow people outside of Fedora (that e.g. also don't tend to use python) to consume our messages easily
c) what if I am doing a ruby app, not python app, do I need then provide ruby schema as well as python schema? What if a consumer is a ruby app? We should only need to write a consumer and producer parts in different languages. The message schemes should not be bound to a particular language, otherwise we are just adding us more work when somebody wants to use the messaging system in another language than python.
clime
On 08/13/2018 10:20 PM, Michal Novotny wrote:
So I got to know on the flock that fedmsg is going to be replaced?
Anyway, it seems that there is an idea to create schemas for the messages and distribute them in packages? And those python packages need to be present on producer as well as consumer?
JSON schemas
Message bodies are JSON objects, that adhere to a schema. Message schemas
live in their own Python package, so they can be installed on the producer and on the consumer.
Could we instead just send the message schemas together with the message content always?
I considered this early on, but it seemed to me it didn't solve all the problems I wanted solved. Those problems are:
1. Make catching accidental schema changes as a publisher easy. 2. Make catching mis-behaving publishers on the consuming side easy. 3. Make changing the schema a painless process for publishers and consumers.
Doing this would solve #1, but #2 and #3 are still a problem. As a consumer, I can validate the JSON in a message matches the JSON schema in the same message, but what does that get me? It doesn't seem any different (on the consumer side) than just parsing the JSON outright and trying to access whatever deserialized object I get.
In the current proposal, consumers don't interact with the JSON at all, but with a higher-level Python API that gives publishers flexibility when altering their on-the-wire format.
I would like to be able to parse any message I receive without some additional packages installed. If I am about to start listening to a new message type, I don't want to spend time to be looking up what i should install to make it work. It should just work. Requiring to have some packages with schemas installed on consumer and having to maintain them by the producer does not seem that great idea. Mainly because one of the raising requirements for fedmsg was that it should be made a generic messaging framework easily usable outside of Fedora Infrastructure. We should make it easy for anyone outside to be able to listen and understand our messages so that they can react to them. Needing to have some python packages installed (how are they going to be distributed PyPI + fedora ?) seems to be just an unnecessary hassle. So can we send a schema with each message as documentation and validation of the message itself?
You can parse any message you receive without anything beyond a JSON parsing library. You can do that now and you'll be able to do that after the move. The problem with that is the JSON format might change. The schema alone doesn't solve the problem of changing formats, it just clearly documents what the message used to be and what it is now.
I'd love for this to just work and I'm up for any suggestions to make it easier, but I do think we need to make sure any solution covers the three problems stated above.
Finally, I do not want to create a generic messaging framework. I want something small that makes a generic messaging framework very easy to use for Fedora infrastructure specifically. I'm happy to help develop a generic framework (like Pika) when necessary, but I don't want to be in the business of authoring and maintaining a generic framework.
a) it will make our life easier
b) it will allow people outside of Fedora (that e.g. also don't tend to use python) to consume our messages easily
c) what if I am doing a ruby app, not python app, do I need then provide ruby schema as well as python schema? What if a consumer is a ruby app? We should only need to write a consumer and producer parts in different languages. The message schemes should not be bound to a particular language, otherwise we are just adding us more work when somebody wants to use the messaging system in another language than python.
I agree, and that's why I chose json-schema. A different language just needs to wrap the schema in accessor functions. An alternative (and something I wanted to propose longer term after the AMQP->ZMQ transition) is to use something like protocol buffers rather than JSON. The advantage there is a simplified schema format, it generally pushes into a pattern of backwards compatibility (thus reducing the need for a higher level API), and it auto-generates an object wrapper in many languages. You still need to potentially implement wrappers for access if you change the schema in a way that isn't additive, though.
You may notice (and it's not an accident) that the recommended implementation of a Message produces an API that is very similar to the one produced by a Python object generated by protocol buffers. This makes it possible to quietly change to protocol buffers without breaking consumers, assuming they're not digging into the JSON. I'm not saying we'll definitely do that, but it is still on the table and a transition _should_ be easy.
The big problem is that right now the majority of messages are not formatted in a way that makes sense and really need to be changed to be simple, flat structures that contain the information services need and nothing they don't. I'd like to get those fixed in a way that doesn't require massive coordinated changes in apps.
Anyway, to summarize, I really really want this to be super easy to use and just work. I hope we can improve it further and I'd love to hear your thoughts. Do you think my problem statements and design goals are reasonable? Given those, do you still feel like sending the schema along is worthwhile?
Anyway, to summarize, I really really want this to be super easy to use
and just work. I hope we can improve it further and I'd love to hear your thoughts. Do you think my problem statements and design goals are reasonable? Given those, do you still feel like sending the schema along is worthwhile?
I actually no longer think it is worthwhile.
As a consumer, I can validate the JSON in a message matches the JSON
schema in the same message, but what does that get me? It doesn't seem any different (on the consumer side) than just parsing the JSON outright and trying to access whatever deserialized object I get.
I completely agree with this.
Let's go through the problems you mentioned:
1. Make catching accidental schema changes as a publisher easy.
So we can just solve this by registering the scheme with the publisher first before any content gets published and based on the scheme, the publisher instance may check if the content intended to be sent conforms to the scheme, which could catch some bugs before the content is actually sent. If we require this to be done on publisher side, then there is actually no reason to send the schema alongside the content because the check has already been done so consumer already knows the message is alright when it is received. What should be sent, however, is a scheme ID, e.g. just a natural number. The scheme ID may be then used to version the scheme, which would be available somewhere publicly e.g. in the service docs the same way Github/Gitlab/etc publishes structures of their webhook messages. It would be basically part of public API of a service.
2. Make catching mis-behaving publishers on the consuming side easy.
By checking against the scheme on the publisher side, this shouldn't be necessary. If someone somehow bypasses the publisher check, at worst the message won't be parsable, depending on how the message is being parsed. If someone wants to really make sure the message is what it is supposed to be, he/she can integrate the schema published on the service site into the parsing logic but I don't think that's necessary thing to do (I personally wouldn't do it in my code).
3. Make changing the schema a painless process for publishers and consumers.
I think, the only way to do this is to send both content types simultaneously for some time, each message being marked with its scheme ID. It would be good if consumer always specified what scheme ID it wants to consume. If there is a higher scheme ID available in the message, a warning could be printed maybe even to syslog even so that consumers get the information. At the same time it should be communicated on the service site or by other means available. I don't think it is possible to make it any better than this.
I fail to see what's the point of packaging the schemas. If the message content is in json, then after receiving the message, I would like to be able to just call json.loads(msg) and work with the resulting structure as I am used to.
Actually, what I would do in python is that I would make it a munch and then work with it. Needing to install some additional package and instantiate some high-level objects just seems clumsy to me in comparison.
In other programming languages, this procedure would be pretty much the same, I believe as they all probably provide some json implementation.
You mentioned:
In the current proposal, consumers don't interact with the JSON at all,
but with a higher-level Python API that gives publishers flexibility when altering their on-the-wire format.
Yes, but with the current proposal if I change the on-the-wire API, I need to make a new version of the schema, package it and somehow get it to consumers and make them use the correct version that correctly parses the new on-the-wire format and translates it correctly to what the consumers are used to consume? That's seems like something very difficult to get done.
And also I don't quite see the point. I wouldn't alter the on-the-wire format if it is not actually what users work with and if I needed to go through all those steps described above.
If I need to alter the on-the-wire format because application logic has been somehow changed, then I would like to make the changes in the high-level API as well so again there is no gain there except more work with packaging new schemas.
My main point here is that trying to package the schemas to provide some high-level objects seems to be redundant. I think lots of people would just welcome to work something really simple, which is already provided in the language standard library.
For python, If I had to install and import just a single messaging library, say to what hub, topic, and scheme ID I want to listen and then consume the incoming messages immediately as munches, I would be super happy.
Actually, it might be the case the scheme ID is redundant as well and it can be just made part of the topic somehow, in which case the producer would probably just produce the content twice on a scheme change at least for some time. "Deprecated by <topic>" flag on an incoming message would be nice then. Of course, the producer would need to register the two schemas and mark one of them as deprecated. The framework would then send two messages simultaneously for him. This might be even easier solution to the problem. The exact publisher (producer) interface would need to be thought through.
The big problem is that right now the majority of messages are not
formatted in a way that makes sense and really need to be changed to be simple, flat structures that contain the information services need and nothing they don't. I'd like to get those fixed in a way that doesn't require massive coordinated changes in apps.
In Copr, for example, we take this as an opportunity to change our format. If the messaging framework will support format deprecation, we might go that way as well to avoid sudden change. But we don't currently have many (or maybe any) consumers so I am not sure it is necessary for us.
I am not familiar with protocol buffers but to me that thing seems rather useful, if you want to send the content in a compact binary form to save as much space as possible. If we will send content, which can be interpreted as json already, then to make some higher-level classes and objects on that seems already unnecessary.
I think we could really just take that already existing generic framework you were talking about (RabbitMQ?) and just make sure we can check the content against message schemas on producer side (which is great for catching little bugs) and that we know how a message format can get deprecated (e.g. by adding "deprecated_by: <topic>" field into each message by the messaging framework, which should somehow log warnings on consumer side), also the framework could automatically transform the messages into some language-native structures: in python, the munches would probably be the most sexy ones.
The whole "let's package schemas" thing seems like something we would typically do (because we are packagers) but not as something that would solve the actual problems you have mentioned. Rather it makes them more difficult to deal with if I am correct.
I think what you are doing is good but I think most people will welcome less dependencies and simpler language-native structures. So if we could make the framework go more into that direction, that would be great.
clime
On Tue, Aug 14, 2018 at 10:55 AM Jeremy Cline jeremy@jcline.org wrote:
On 08/13/2018 10:20 PM, Michal Novotny wrote:
So I got to know on the flock that fedmsg is going to be replaced?
Anyway, it seems that there is an idea to create schemas for the messages and distribute them in packages? And those python packages need to be present on producer as well as consumer?
JSON schemas
Message bodies are JSON objects, that adhere to a schema. Message
schemas
live in their own Python package, so they can be installed on the
producer
and on the consumer.
Could we instead just send the message schemas together with the message content always?
I considered this early on, but it seemed to me it didn't solve all the problems I wanted solved. Those problems are:
- Make catching accidental schema changes as a publisher easy.
- Make catching mis-behaving publishers on the consuming side easy.
- Make changing the schema a painless process for publishers and consumers.
Doing this would solve #1, but #2 and #3 are still a problem. As a consumer, I can validate the JSON in a message matches the JSON schema in the same message, but what does that get me? It doesn't seem any different (on the consumer side) than just parsing the JSON outright and trying to access whatever deserialized object I get.
In the current proposal, consumers don't interact with the JSON at all, but with a higher-level Python API that gives publishers flexibility when altering their on-the-wire format.
I would like to be able to parse any message I receive without some additional packages installed. If I am about to start listening to a new message type, I don't want to spend time to be looking up what i should install to make it work. It should just work. Requiring to have some packages with schemas installed on consumer and having to maintain them
by
the producer does not seem that great idea. Mainly because one of the raising requirements for fedmsg was that it should be made a generic messaging framework easily usable outside of Fedora Infrastructure. We should make it easy for anyone outside to be able to listen and
understand
our messages so that they can react to them. Needing to have some python packages installed (how are they going to be distributed PyPI + fedora ?) seems to be just an unnecessary hassle. So can we send a schema with each message as documentation and validation of the message itself?
You can parse any message you receive without anything beyond a JSON parsing library. You can do that now and you'll be able to do that after the move. The problem with that is the JSON format might change. The schema alone doesn't solve the problem of changing formats, it just clearly documents what the message used to be and what it is now.
I'd love for this to just work and I'm up for any suggestions to make it easier, but I do think we need to make sure any solution covers the three problems stated above.
Finally, I do not want to create a generic messaging framework. I want something small that makes a generic messaging framework very easy to use for Fedora infrastructure specifically. I'm happy to help develop a generic framework (like Pika) when necessary, but I don't want to be in the business of authoring and maintaining a generic framework.
a) it will make our life easier
b) it will allow people outside of Fedora (that e.g. also don't tend to
use
python) to consume our messages easily
c) what if I am doing a ruby app, not python app, do I need then provide ruby schema as well as python schema? What if a consumer is a ruby app?
We
should only need to write a consumer and producer parts in different languages. The message schemes should not be bound to a particular language, otherwise we are just adding us more work when somebody wants
to
use the messaging system in another language than python.
I agree, and that's why I chose json-schema. A different language just needs to wrap the schema in accessor functions. An alternative (and something I wanted to propose longer term after the AMQP->ZMQ transition) is to use something like protocol buffers rather than JSON. The advantage there is a simplified schema format, it generally pushes into a pattern of backwards compatibility (thus reducing the need for a higher level API), and it auto-generates an object wrapper in many languages. You still need to potentially implement wrappers for access if you change the schema in a way that isn't additive, though.
You may notice (and it's not an accident) that the recommended implementation of a Message produces an API that is very similar to the one produced by a Python object generated by protocol buffers. This makes it possible to quietly change to protocol buffers without breaking consumers, assuming they're not digging into the JSON. I'm not saying we'll definitely do that, but it is still on the table and a transition _should_ be easy.
The big problem is that right now the majority of messages are not formatted in a way that makes sense and really need to be changed to be simple, flat structures that contain the information services need and nothing they don't. I'd like to get those fixed in a way that doesn't require massive coordinated changes in apps.
Anyway, to summarize, I really really want this to be super easy to use and just work. I hope we can improve it further and I'd love to hear your thoughts. Do you think my problem statements and design goals are reasonable? Given those, do you still feel like sending the schema along is worthwhile?
-- Jeremy Cline XMPP: jeremy@jcline.org IRC: jcline
Oh yeah, protocol buffers are useful probably in strongly typed languages with lots of types to maintain that type information and being able to parse out the serialized content based on that type information. But in case we are transferring json, we send the type information in the content itself ({}, [], "", <int>) so we don't need to know anything else except the content, we just need to decide how we are going to transform the content into language-native structures. It is just an additional thing that I've just realized but I am little bit guessing here. It might not be relevant. Forgive me if it is not.
On Wed, Aug 15, 2018 at 2:53 PM Michal Novotny clime@redhat.com wrote:
Anyway, to summarize, I really really want this to be super easy to use
and just work. I hope we can improve it further and I'd love to hear your thoughts. Do you think my problem statements and design goals are reasonable? Given those, do you still feel like sending the schema along is worthwhile?
I actually no longer think it is worthwhile.
As a consumer, I can validate the JSON in a message matches the JSON
schema in the same message, but what does that get me? It doesn't seem any different (on the consumer side) than just parsing the JSON outright and trying to access whatever deserialized object I get.
I completely agree with this.
Let's go through the problems you mentioned:
- Make catching accidental schema changes as a publisher easy.
So we can just solve this by registering the scheme with the publisher first before any content gets published and based on the scheme, the publisher instance may check if the content intended to be sent conforms to the scheme, which could catch some bugs before the content is actually sent. If we require this to be done on publisher side, then there is actually no reason to send the schema alongside the content because the check has already been done so consumer already knows the message is alright when it is received. What should be sent, however, is a scheme ID, e.g. just a natural number. The scheme ID may be then used to version the scheme, which would be available somewhere publicly e.g. in the service docs the same way Github/Gitlab/etc publishes structures of their webhook messages. It would be basically part of public API of a service.
- Make catching mis-behaving publishers on the consuming side easy.
By checking against the scheme on the publisher side, this shouldn't be necessary. If someone somehow bypasses the publisher check, at worst the message won't be parsable, depending on how the message is being parsed. If someone wants to really make sure the message is what it is supposed to be, he/she can integrate the schema published on the service site into the parsing logic but I don't think that's necessary thing to do (I personally wouldn't do it in my code).
- Make changing the schema a painless process for publishers and consumers.
I think, the only way to do this is to send both content types simultaneously for some time, each message being marked with its scheme ID. It would be good if consumer always specified what scheme ID it wants to consume. If there is a higher scheme ID available in the message, a warning could be printed maybe even to syslog even so that consumers get the information. At the same time it should be communicated on the service site or by other means available. I don't think it is possible to make it any better than this.
I fail to see what's the point of packaging the schemas. If the message content is in json, then after receiving the message, I would like to be able to just call json.loads(msg) and work with the resulting structure as I am used to.
Actually, what I would do in python is that I would make it a munch and then work with it. Needing to install some additional package and instantiate some high-level objects just seems clumsy to me in comparison.
In other programming languages, this procedure would be pretty much the same, I believe as they all probably provide some json implementation.
You mentioned:
In the current proposal, consumers don't interact with the JSON at all,
but with a higher-level Python API that gives publishers flexibility when altering their on-the-wire format.
Yes, but with the current proposal if I change the on-the-wire API, I need to make a new version of the schema, package it and somehow get it to consumers and make them use the correct version that correctly parses the new on-the-wire format and translates it correctly to what the consumers are used to consume? That's seems like something very difficult to get done.
And also I don't quite see the point. I wouldn't alter the on-the-wire format if it is not actually what users work with and if I needed to go through all those steps described above.
If I need to alter the on-the-wire format because application logic has been somehow changed, then I would like to make the changes in the high-level API as well so again there is no gain there except more work with packaging new schemas.
My main point here is that trying to package the schemas to provide some high-level objects seems to be redundant. I think lots of people would just welcome to work something really simple, which is already provided in the language standard library.
For python, If I had to install and import just a single messaging library, say to what hub, topic, and scheme ID I want to listen and then consume the incoming messages immediately as munches, I would be super happy.
Actually, it might be the case the scheme ID is redundant as well and it can be just made part of the topic somehow, in which case the producer would probably just produce the content twice on a scheme change at least for some time. "Deprecated by <topic>" flag on an incoming message would be nice then. Of course, the producer would need to register the two schemas and mark one of them as deprecated. The framework would then send two messages simultaneously for him. This might be even easier solution to the problem. The exact publisher (producer) interface would need to be thought through.
The big problem is that right now the majority of messages are not
formatted in a way that makes sense and really need to be changed to be simple, flat structures that contain the information services need and nothing they don't. I'd like to get those fixed in a way that doesn't require massive coordinated changes in apps.
In Copr, for example, we take this as an opportunity to change our format. If the messaging framework will support format deprecation, we might go that way as well to avoid sudden change. But we don't currently have many (or maybe any) consumers so I am not sure it is necessary for us.
I am not familiar with protocol buffers but to me that thing seems rather useful, if you want to send the content in a compact binary form to save as much space as possible. If we will send content, which can be interpreted as json already, then to make some higher-level classes and objects on that seems already unnecessary.
I think we could really just take that already existing generic framework you were talking about (RabbitMQ?) and just make sure we can check the content against message schemas on producer side (which is great for catching little bugs) and that we know how a message format can get deprecated (e.g. by adding "deprecated_by: <topic>" field into each message by the messaging framework, which should somehow log warnings on consumer side), also the framework could automatically transform the messages into some language-native structures: in python, the munches would probably be the most sexy ones.
The whole "let's package schemas" thing seems like something we would typically do (because we are packagers) but not as something that would solve the actual problems you have mentioned. Rather it makes them more difficult to deal with if I am correct.
I think what you are doing is good but I think most people will welcome less dependencies and simpler language-native structures. So if we could make the framework go more into that direction, that would be great.
clime
On Tue, Aug 14, 2018 at 10:55 AM Jeremy Cline jeremy@jcline.org wrote:
On 08/13/2018 10:20 PM, Michal Novotny wrote:
So I got to know on the flock that fedmsg is going to be replaced?
Anyway, it seems that there is an idea to create schemas for the
messages
and distribute them in packages? And those python packages need to be present on producer as well as consumer?
JSON schemas
Message bodies are JSON objects, that adhere to a schema. Message
schemas
live in their own Python package, so they can be installed on the
producer
and on the consumer.
Could we instead just send the message schemas together with the message content always?
I considered this early on, but it seemed to me it didn't solve all the problems I wanted solved. Those problems are:
- Make catching accidental schema changes as a publisher easy.
- Make catching mis-behaving publishers on the consuming side easy.
- Make changing the schema a painless process for publishers and consumers.
Doing this would solve #1, but #2 and #3 are still a problem. As a consumer, I can validate the JSON in a message matches the JSON schema in the same message, but what does that get me? It doesn't seem any different (on the consumer side) than just parsing the JSON outright and trying to access whatever deserialized object I get.
In the current proposal, consumers don't interact with the JSON at all, but with a higher-level Python API that gives publishers flexibility when altering their on-the-wire format.
I would like to be able to parse any message I receive without some additional packages installed. If I am about to start listening to a new message type, I don't want to spend time to be looking up what i should install to make it work. It should just work. Requiring to have some packages with schemas installed on consumer and having to maintain them
by
the producer does not seem that great idea. Mainly because one of the raising requirements for fedmsg was that it should be made a generic messaging framework easily usable outside of Fedora Infrastructure. We should make it easy for anyone outside to be able to listen and
understand
our messages so that they can react to them. Needing to have some python packages installed (how are they going to be distributed PyPI + fedora
?)
seems to be just an unnecessary hassle. So can we send a schema with
each
message as documentation and validation of the message itself?
You can parse any message you receive without anything beyond a JSON parsing library. You can do that now and you'll be able to do that after the move. The problem with that is the JSON format might change. The schema alone doesn't solve the problem of changing formats, it just clearly documents what the message used to be and what it is now.
I'd love for this to just work and I'm up for any suggestions to make it easier, but I do think we need to make sure any solution covers the three problems stated above.
Finally, I do not want to create a generic messaging framework. I want something small that makes a generic messaging framework very easy to use for Fedora infrastructure specifically. I'm happy to help develop a generic framework (like Pika) when necessary, but I don't want to be in the business of authoring and maintaining a generic framework.
a) it will make our life easier
b) it will allow people outside of Fedora (that e.g. also don't tend to
use
python) to consume our messages easily
c) what if I am doing a ruby app, not python app, do I need then provide ruby schema as well as python schema? What if a consumer is a ruby app?
We
should only need to write a consumer and producer parts in different languages. The message schemes should not be bound to a particular language, otherwise we are just adding us more work when somebody wants
to
use the messaging system in another language than python.
I agree, and that's why I chose json-schema. A different language just needs to wrap the schema in accessor functions. An alternative (and something I wanted to propose longer term after the AMQP->ZMQ transition) is to use something like protocol buffers rather than JSON. The advantage there is a simplified schema format, it generally pushes into a pattern of backwards compatibility (thus reducing the need for a higher level API), and it auto-generates an object wrapper in many languages. You still need to potentially implement wrappers for access if you change the schema in a way that isn't additive, though.
You may notice (and it's not an accident) that the recommended implementation of a Message produces an API that is very similar to the one produced by a Python object generated by protocol buffers. This makes it possible to quietly change to protocol buffers without breaking consumers, assuming they're not digging into the JSON. I'm not saying we'll definitely do that, but it is still on the table and a transition _should_ be easy.
The big problem is that right now the majority of messages are not formatted in a way that makes sense and really need to be changed to be simple, flat structures that contain the information services need and nothing they don't. I'd like to get those fixed in a way that doesn't require massive coordinated changes in apps.
Anyway, to summarize, I really really want this to be super easy to use and just work. I hope we can improve it further and I'd love to hear your thoughts. Do you think my problem statements and design goals are reasonable? Given those, do you still feel like sending the schema along is worthwhile?
-- Jeremy Cline XMPP: jeremy@jcline.org IRC: jcline
On 08/15/2018 01:53 PM, Michal Novotny wrote:>> 1. Make catching accidental schema changes as a publisher easy.
So we can just solve this by registering the scheme with the publisher first before any content gets published and based on the scheme, the publisher instance may check if the content intended to be sent conforms to the scheme, which could catch some bugs before the content is actually sent. If we require this to be done on publisher side, then there is actually no reason to send the schema alongside the content because the check has already been done so consumer already knows the message is alright when it is received. What should be sent, however, is a scheme ID, e.g. just a natural number. The scheme ID may be then used to version the scheme, which would be available somewhere publicly e.g. in the service docs the same way Github/Gitlab/etc publishes structures of their webhook messages. It would be basically part of public API of a service.
Yes, the schema needs unique identifier. This is currently provided by the full Python path of the class, but could be done a different way, as you point out.
- Make catching mis-behaving publishers on the consuming side easy.
By checking against the scheme on the publisher side, this shouldn't be necessary. If someone somehow bypasses the publisher check, at worst the message won't be parsable, depending on how the message is being parsed. If someone wants to really make sure the message is what it is supposed to be, he/she can integrate the schema published on the service site into the parsing logic but I don't think that's necessary thing to do (I personally wouldn't do it in my code).
So most of the work in fedora-messaging is to make this as painless as possible. Checking the message on both the publisher side and consumer side is helpful for a few reasons. For example, the publisher changes the schema and fails to change the unique identifier for it (however that's being defined)?
- Make changing the schema a painless process for publishers and
consumers.
I think, the only way to do this is to send both content types simultaneously for some time, each message being marked with its scheme ID. It would be good if consumer always specified what scheme ID it wants to consume. If there is a higher scheme ID available in the message, a warning could be printed maybe even to syslog even so that consumers get the information. At the same time it should be communicated on the service site or by other means available. I don't think it is possible to make it any better than this.
It's not the only way to do it, but it is one way to do it. It's all a matter of complexity and where you want to put that complexity. You can, for example, have a scheme where routing key (topic) includes the schema identity. With this approach, apps need to publish both for a while, and then drop the old topic at some point. That's fine.
You can run an intermediate service that knows the various schema and publishes every version. That's fine, too.
You can do what we opted to do and produce a schema, wrap it in a high-level API, and then change it without breaking that high-level API. This, of course, requires distributing that high-level API, thus the packaging. If this _isn't_ the route we go, the rule basically becomes "no changing your message schema ever, just make a new type".
It all comes to the same thing in the end, it's just a matter of where you deal with compatibility changes. What I'm advocating for is not making the wire format (some JSON dictionary) the public API because that has worked very poorly for fedmsg. If the wire format was something like a protocol buffer, it has some of these ideas built in and is more reasonable to use directly (although in some cases of higher-level APIs can be useful).
I fail to see what's the point of packaging the schemas. If the message content is in json, then after receiving the message, I would like to be able to just call json.loads(msg) and work with the resulting structure as I am used to.
Actually, what I would do in python is that I would make it a munch and then work with it. Needing to install some additional package and instantiate some high-level objects just seems clumsy to me in comparison.
It's certainly more work, yes. What you get in exchange is freedom to change your wire format. Maybe that's not something you need or want. The point of packaging the schema is to distribute the Python API without doing something crazy like pickle.
And for the record, I'm all in favor of running a PyPI mirror and deploying our apps to OpenShift with s2i, thus skipping RPM entirely. I'm fine with automatically converting them to RPM with a tool, too. The Python packaging is trivial (5 minutes, there's a template, then just upload it to PyPI).
In other programming languages, this procedure would be pretty much the same, I believe as they all probably provide some json implementation.
You mentioned:
In the current proposal, consumers don't interact with the JSON at all, but with a higher-level Python API that gives publishers flexibility when altering their on-the-wire format.
Yes, but with the current proposal if I change the on-the-wire API, I need to make a new version of the schema, package it and somehow get it to consumers and make them use the correct version that correctly parses the new on-the-wire format and translates it correctly to what the consumers are used to consume? That's seems like something very difficult to get done.
Yes, you need to do that. If you don't do that, your alternative is a flag day where you update the producer and consumers at the exact same time and make sure no messages linger in queues. It's what we do now and it doesn't work.
The big problem is that right now the majority of messages are not formatted in a way that makes sense and really need to be changed to be simple, flat structures that contain the information services need and nothing they don't. I'd like to get those fixed in a way that doesn't require massive coordinated changes in apps.
In Copr, for example, we take this as an opportunity to change our format. If the messaging framework will support format deprecation, we might go that way as well to avoid sudden change. But we don't currently have many (or maybe any) consumers so I am not sure it is necessary for us.
I am not familiar with protocol buffers but to me that thing seems rather useful, if you want to send the content in a compact binary form to save as much space as possible. If we will send content, which can be interpreted as json already, then to make some higher-level classes and objects on that seems already unnecessary.
I think we could really just take that already existing generic framework you were talking about (RabbitMQ?) and just make sure we can check the content against message schemas on producer side (which is great for catching little bugs) and that we know how a message format can get deprecated (e.g. by adding "deprecated_by: <topic>" field into each message by the messaging framework, which should somehow log warnings on consumer side), also the framework could automatically transform the messages into some language-native structures: in python, the munches would probably be the most sexy ones.
The whole "let's package schemas" thing seems like something we would typically do (because we are packagers) but not as something that would solve the actual problems you have mentioned. Rather it makes them more difficult to deal with if I am correct.
As many people will tell you, I am not a big believer in the "let's turn everything into RPMs manually" idea. I'm fine, as I mentioned, with just a Python package.
However, I think you're underestimating the power of a high-level API. Consider, for example, message notifications (FMN and whatever its successor looks like).
There needs to be a consistent way to take that message, whatever its wire format, and turn it into a human-readable message. There needs to be a consistent way to extract the users associated with a message. There needs to be a way to extract what packages are affected by the message.
The current solution is fedmsg-meta-fedora-infrastructure, a central Python module where schema are poorly encoded by a series of if/else statements. It also regularly breaks when message schema change. For example, I have 2200 emails from the notifications system about how some Copr and Bodhi messages are unparsable. No one remembers to update the package, and it ultimately means their messages are dropped or arrive to users as incomprehensible JSON.
With the current approach, you can just implement a __str__ method on a class you keep in the same Git repo you use for your project. You can write documentation on the classes so users can see what messages your projects send. You can release it whenever you see fit, not when whoever maintains fedmsg-meta has time to make a release.
It seems like your main objection is the Python package. Personally, I think making a Python package is a trivial amount of work for the benefit of being able to define an arbitrary Python API to work with your messages, but maybe that's not a widely-shared sentiment. If it's not and we decide the only thing we really want in addition to the message is a human-readable string, maybe we could include that in the message in a standard way. Things like i18n notifications might no longer be as easy, though.
On Thu, Aug 16, 2018 at 11:43 AM Jeremy Cline jeremy@jcline.org wrote:
On 08/15/2018 01:53 PM, Michal Novotny wrote:>> 1. Make catching accidental schema changes as a publisher easy.
So we can just solve this by registering the scheme with the publisher first before any content gets published and based on the scheme, the publisher instance may check if the content intended to be sent conforms to the scheme, which could catch some bugs before the content is actually sent. If we require this to be done on publisher side, then there is actually no reason to send the schema alongside the content because the check has already been done so consumer already knows the message is alright when it is received. What should be sent, however, is a scheme ID, e.g. just a natural number. The scheme ID may be then used to version the scheme, which would be available somewhere publicly e.g. in the service docs the same way Github/Gitlab/etc publishes
structures
of their webhook messages. It would be basically part of public API of a service.
Yes, the schema needs unique identifier. This is currently provided by the full Python path of the class, but could be done a different way, as you point out.
- Make catching mis-behaving publishers on the consuming side easy.
By checking against the scheme on the publisher side, this shouldn't be necessary. If someone somehow bypasses the publisher check, at worst the message won't be parsable, depending on how the message is being parsed. If someone wants to really make sure the message is what it is supposed to be, he/she can integrate the schema published on the service site into the parsing logic but I don't think that's necessary thing to do (I personally wouldn't do it in my code).
So most of the work in fedora-messaging is to make this as painless as possible. Checking the message on both the publisher side and consumer side is helpful for a few reasons. For example, the publisher changes the schema and fails to change the unique identifier for it (however that's being defined)?
- Make changing the schema a painless process for publishers and
consumers.
I think, the only way to do this is to send both content types simultaneously for some time, each message being marked with its scheme ID. It would be good if consumer always specified what scheme ID it wants to consume. If there is a higher scheme ID available in the message, a warning could
be
printed maybe even to syslog even so that consumers get the information. At the same time it should be communicated on the service site or by other means available. I don't think it is possible to make it any better than this.
It's not the only way to do it, but it is one way to do it. It's all a matter of complexity and where you want to put that complexity. You can, for example, have a scheme where routing key (topic) includes the schema identity. With this approach, apps need to publish both for a while, and then drop the old topic at some point. That's fine.
You can run an intermediate service that knows the various schema and publishes every version. That's fine, too.
You can do what we opted to do and produce a schema, wrap it in a high-level API, and then change it without breaking that high-level API. This, of course, requires distributing that high-level API, thus the packaging. If this _isn't_ the route we go, the rule basically becomes "no changing your message schema ever, just make a new type".
It all comes to the same thing in the end, it's just a matter of where you deal with compatibility changes. What I'm advocating for is not making the wire format (some JSON dictionary) the public API because that has worked very poorly for fedmsg. If the wire format was something like a protocol buffer, it has some of these ideas built in and is more reasonable to use directly (although in some cases of higher-level APIs can be useful).
I fail to see what's the point of packaging the schemas. If the message content is in json, then after receiving the message, I would like to be able to just call json.loads(msg) and work with the resulting structure as I am used to.
Actually, what I would do in python is that I would make it a munch and then work with it. Needing to install some additional package and instantiate some high-level objects just seems clumsy to me in comparison.
It's certainly more work, yes. What you get in exchange is freedom to change your wire format. Maybe that's not something you need or want. The point of packaging the schema is to distribute the Python API without doing something crazy like pickle.
And for the record, I'm all in favor of running a PyPI mirror and deploying our apps to OpenShift with s2i, thus skipping RPM entirely. I'm fine with automatically converting them to RPM with a tool, too. The Python packaging is trivial (5 minutes, there's a template, then just upload it to PyPI).
In other programming languages, this procedure would be pretty much the same, I believe as they all probably provide some json implementation.
You mentioned:
In the current proposal, consumers don't interact with the JSON at all, but with a higher-level Python API that gives publishers flexibility when altering their on-the-wire format.
Yes, but with the current proposal if I change the on-the-wire API, I
need
to make a new version of the schema, package it and somehow get it to consumers and make them use the correct version that correctly parses the new on-the-wire format and translates it correctly to what the
consumers
are used to consume? That's seems like something very difficult to get done.
Yes, you need to do that. If you don't do that, your alternative is a flag day where you update the producer and consumers at the exact same time and make sure no messages linger in queues. It's what we do now and it doesn't work.
Well, or send both messages simultaneously for some time.
The big problem is that right now the majority of messages are not formatted in a way that makes sense and really need to be changed to be simple, flat structures that contain the information services need and nothing they don't. I'd like to get those fixed in a way that doesn't require massive coordinated changes in apps.
In Copr, for example, we take this as an opportunity to change our format. If the messaging framework will support format deprecation, we might go that way as well to avoid sudden change. But we don't currently have many (or maybe any) consumers so I am not sure it is necessary for us.
I am not familiar with protocol buffers but to me that thing seems rather useful, if you want to send the content in a compact binary form to save as much space as possible. If we will send content, which can be interpreted as json already, then to make some higher-level classes and objects on that seems already unnecessary.
I think we could really just take that already existing generic framework you were talking about (RabbitMQ?) and just make sure we can check the content against message schemas on producer side (which is great for catching little bugs) and that we know how a message format can get deprecated (e.g. by adding "deprecated_by: <topic>" field into each message by the messaging framework, which should somehow log warnings on consumer side), also the framework could automatically transform the messages into some language-native structures: in python, the munches would probably be the most sexy ones.
The whole "let's package schemas" thing seems like something we would typically do (because we are packagers) but not as something that would solve the actual problems you have mentioned. Rather it makes them more difficult to deal with if I am correct.
As many people will tell you, I am not a big believer in the "let's turn everything into RPMs manually" idea. I'm fine, as I mentioned, with just a Python package.
However, I think you're underestimating the power of a high-level API. Consider, for example, message notifications (FMN and whatever its successor looks like).
There needs to be a consistent way to take that message, whatever its wire format, and turn it into a human-readable message. There needs to be a consistent way to extract the users associated with a message. There needs to be a way to extract what packages are affected by the message.
The current solution is fedmsg-meta-fedora-infrastructure, a central Python module where schema are poorly encoded by a series of if/else statements. It also regularly breaks when message schema change. For example, I have 2200 emails from the notifications system about how some Copr and Bodhi messages are unparsable. No one remembers to update the package, and it ultimately means their messages are dropped or arrive to users as incomprehensible JSON.
Yup, on behalf of Copr, I am sorry for that. This was caused by some bugs in our code. But these things would be captured by the publisher validation in the new framework. By the way, we would also like to have validators like "NEVRA" available, maybe in a library, maybe we can implement it ourselves. In one of the instances, we weren't sending release (I think) and it broke the fedmsg-meta service. That service is kind of sensitive.
With the current approach, you can just implement a __str__ method on a class you keep in the same Git repo you use for your project. You can write documentation on the classes so users can see what messages your projects send. You can release it whenever you see fit, not when whoever maintains fedmsg-meta has time to make a release.
It seems like your main objection is the Python package. Personally, I think making a Python package is a trivial amount of work for the benefit of being able to define an arbitrary Python API to work with your messages, but maybe that's not a widely-shared sentiment. If it's not and we decide the only thing we really want in addition to the message is a human-readable string, maybe we could include that in the message in a standard way.
Might be also a way.
Things like i18n notifications might no longer be as easy, though.
-- Jeremy Cline XMPP: jeremy@jcline.org IRC: jcline
On 08/16/2018 11:53 AM, Michal Novotny wrote:
On Thu, Aug 16, 2018 at 11:43 AM Jeremy Cline jeremy@jcline.org wrote:
<snip>
The current solution is fedmsg-meta-fedora-infrastructure, a central Python module where schema are poorly encoded by a series of if/else statements. It also regularly breaks when message schema change. For example, I have 2200 emails from the notifications system about how some Copr and Bodhi messages are unparsable. No one remembers to update the package, and it ultimately means their messages are dropped or arrive to users as incomprehensible JSON.
Yup, on behalf of Copr, I am sorry for that. This was caused by some bugs in our code. But these things would be captured by the publisher validation in the new framework. By the way, we would also like to have validators like "NEVRA" available, maybe in a library, maybe we can implement it ourselves. In one of the instances, we weren't sending release (I think) and it broke the fedmsg-meta service. That service is kind of sensitive.
Yes, it is sensitive. And to be clear, I'm not pointing fingers here. It's just a good example of how what we're doing now doesn't work. I want to put the ability (and responsibility) to making a message readable and documented in the hands of app maintainers.
With the current approach, you can just implement a __str__ method on a class you keep in the same Git repo you use for your project. You can write documentation on the classes so users can see what messages your projects send. You can release it whenever you see fit, not when whoever maintains fedmsg-meta has time to make a release.
It seems like your main objection is the Python package. Personally, I think making a Python package is a trivial amount of work for the benefit of being able to define an arbitrary Python API to work with your messages, but maybe that's not a widely-shared sentiment. If it's not and we decide the only thing we really want in addition to the message is a human-readable string, maybe we could include that in the message in a standard way.
Might be also a way.
So, am I right in saying your main objection is the Python package? Or do you object to then packaging that as an RPM?
On Thu, Aug 16, 2018 at 4:34 PM Jeremy Cline jeremy@jcline.org wrote:
On 08/16/2018 11:53 AM, Michal Novotny wrote:
On Thu, Aug 16, 2018 at 11:43 AM Jeremy Cline jeremy@jcline.org wrote:
<snip>
The current solution is fedmsg-meta-fedora-infrastructure, a central Python module where schema are poorly encoded by a series of if/else statements. It also regularly breaks when message schema change. For example, I have 2200 emails from the notifications system about how some Copr and Bodhi messages are unparsable. No one remembers to update the package, and it ultimately means their messages are dropped or arrive to users as incomprehensible JSON.
Yup, on behalf of Copr, I am sorry for that. This was caused by some
bugs in
our code. But these things would be captured by the publisher validation
in
the new framework. By the way, we would also like to have validators like "NEVRA" available, maybe in a library, maybe we can implement it
ourselves.
In one of the instances, we weren't sending release (I think) and it
broke
the fedmsg-meta service. That service is kind of sensitive.
Yes, it is sensitive. And to be clear, I'm not pointing fingers here. It's just a good example of how what we're doing now doesn't work. I want to put the ability (and responsibility) to making a message readable and documented in the hands of app maintainers.
With the current approach, you can just implement a __str__ method on a class you keep in the same Git repo you use for your project. You can write documentation on the classes so users can see what messages your projects send. You can release it whenever you see fit, not when whoever maintains fedmsg-meta has time to make a release.
It seems like your main objection is the Python package. Personally, I think making a Python package is a trivial amount of work for the benefit of being able to define an arbitrary Python API to work with your messages, but maybe that's not a widely-shared sentiment. If it's not and we decide the only thing we really want in addition to the message is a human-readable string, maybe we could include that in the message in a standard way.
Might be also a way.
So, am I right in saying your main objection is the Python package? Or do you object to then packaging that as an RPM?
I don't really have any objections. I would just like to be able to read messages as simple language-native structures and don't depend on anything else than the base messaging framework when publishing or receiving messages.
-- Jeremy Cline XMPP: jeremy@jcline.org IRC: jcline
On 08/16/2018 06:22 PM, Michal Novotny wrote:
On Thu, Aug 16, 2018 at 4:34 PM Jeremy Cline jeremy@jcline.org wrote:
So, am I right in saying your main objection is the Python package? Or do you object to then packaging that as an RPM?
I don't really have any objections. I would just like to be able to read messages as simple language-native structures and don't depend on anything else than the base messaging framework when publishing or receiving messages.
Okay. You can do that now. There's a base Message class whose only schema restriction is that the message is a JSON object. You're free to use that and access the JSON directly, or to use an AMQP client directly (as long as you follow the same on-the-wire format). Just be aware that any messages you send that way will integrate poorly (or not at all) with services like notifications.
I really don't recommend this approach since we've been down this path before and it's worked rather poorly, but I also don't want these tools to be a burden.
On Fri, Aug 17, 2018 at 9:40 AM Jeremy Cline jeremy@jcline.org wrote:
On 08/16/2018 06:22 PM, Michal Novotny wrote:
On Thu, Aug 16, 2018 at 4:34 PM Jeremy Cline jeremy@jcline.org wrote:
So, am I right in saying your main objection is the Python package? Or do you object to then packaging that as an RPM?
I don't really have any objections. I would just like to be able to read messages as simple language-native structures and don't depend on anything else than the base messaging framework when publishing or receiving messages.
Okay. You can do that now. There's a base Message class whose only schema restriction is that the message is a JSON object. You're free to use that and access the JSON directly, or to use an AMQP client directly (as long as you follow the same on-the-wire format).
Will the framework provide me with a way to automatically validate against a schema that I have locally defined e.g.
my_locally_defined_schema_that_i_can_then_publish_in_docs = { "type" : "object", "properties" : { "price" : {"type" : "number"}, "name" : {"type" : "string"}, },}
if I want to use just the base Message? It would be good if I could pass the body_scheme to Message __init__ method and have that validated by api.publish, which would then invoke e.g. message.validate(). Would it be possible?
Just be aware that
any messages you send that way will integrate poorly (or not at all) with services like notifications.
I would like it to integrate well if possible...
I really don't recommend this approach since we've been down this path before and it's worked rather poorly
Okay, but the question is why it was working poorly. From what I've seen, the problems with fedmsg-meta would be just solved just by the explicit scheme validation on publisher side, which is really a cool thing you are introducing and it will help a lot.
Another thing that it is quite unclear why that service just got completely stuck when it couldn't process a message. Normally, you would just collect what you can collect out of the incoming message and send what you have collected (i.e. with some fields not filled). It only constructed human-readable strings for notifications, right? I don't think that's so mission critical even though important.
But anyway, still, if we have validation on publisher built-in into the framework, that's enough to solve that problem. Why would we need anything else?
Schema are Immutable
Message schema should be treated as immutable. Once defined, they should
not be altered. Instead, define a new schema class, mark the old one as deprecated, and remove it after an appropriate transition period. I think that adding new fields into the json should be simply allowed. That's not a backward incompatible change. The consumer doesn't use those fields because they have been just introduced so you can't break a consuming script like that.
Finally, you must distribute your schema to clients. It is recommended
that you maintain your message schema in your application’s git repository in a separate Python package. The package name should be <your-app-name>_schema.
What I was thinking about is to have the .json schema in a separate file with recommendation that it contains URL to itself as ID ( https://pagure.io/copr/copr/raw/master/f/dist-git/my_schema.json). Given that the schema can be self descriptive (because there are the 'description' fields), then this should be completely enough to even provide the documentation and at the same time, that schema will be the actual schema publisher will use for validation.
I think this is much more chilled out approach which is, at the same time, solving the problem with publisher sending something else than he/she thinks he/she is sending. I think that was the main problem we were having, no?
-- Jeremy Cline XMPP: jeremy@jcline.org IRC: jcline
On 08/17/2018 04:18 PM, Michal Novotny wrote:>
Will the framework provide me with a way to automatically validate against a schema that I have locally defined e.g.
my_locally_defined_schema_that_i_can_then_publish_in_docs = { "type" : "object", "properties" : { "price" : {"type" : "number"}, "name" : {"type" : "string"}, },}
if I want to use just the base Message? It would be good if I could pass the body_scheme to Message __init__ method and have that validated by api.publish, which would then invoke e.g. message.validate(). Would it be possible?
If this all you want, just call ``jsonschema.validate`` yourself. Then send it as a base Message.
Just be aware that
any messages you send that way will integrate poorly (or not at all) with services like notifications.
I would like it to integrate well if possible...
Well... If FMN doesn't have a way to understand the messages, it can't really do nice integration. I'm not sure what else to say here other than if you want nice integration, you'll need to spend the ~15 minutes required to publish your schema + minimal Python class.
Seriously, it's just:
$ python setup.py sdist bdist_wheel $ twine upload <dists>
It's *really* easy. The time we've spent on this is vastly more than the amount of work required.
I really don't recommend this approach since we've been down this path before and it's worked rather poorly
Okay, but the question is why it was working poorly. From what I've seen, the problems with fedmsg-meta would be just solved just by the explicit scheme validation on publisher side, which is really a cool thing you are introducing and it will help a lot.
Another thing that it is quite unclear why that service just got completely stuck when it couldn't process a message. Normally, you would just collect what you can collect out of the incoming message and send what you have collected (i.e. with some fields not filled). It only constructed human-readable strings for notifications, right? I don't think that's so mission critical even though important.
People do care about notifications some, but they become much less useful if they're incomprehensibly JSON. FMN has plenty of problems beyond the ever-changing messages that I won't burden you with, but yes, a reasonable service handles bad input gracefully. This is why all unparsable messages are now dropped by the notification service.
But anyway, still, if we have validation on publisher built-in into the framework, that's enough to solve that problem. Why would we need anything else?
Defining an API above a JSON object can be very useful. If you don't need it, that's great, but it doesn't mean others don't want to use it.
Schema are Immutable
Message schema should be treated as immutable. Once defined, they should
not be altered. Instead, define a new schema class, mark the old one as deprecated, and remove it after an appropriate transition period. I think that adding new fields into the json should be simply allowed. That's not a backward incompatible change. The consumer doesn't use those fields because they have been just introduced so you can't break a consuming script like that.
I agree the documentation here should be changed. Schema should not change in a backwards-incompatible way, but additional keys in an object are (by default) okay in JSON schema.
Finally, you must distribute your schema to clients. It is recommended
that you maintain your message schema in your application’s git repository in a separate Python package. The package name should be <your-app-name>_schema.
What I was thinking about is to have the .json schema in a separate file with recommendation that it contains URL to itself as ID ( https://pagure.io/copr/copr/raw/master/f/dist-git/my_schema.json). Given that the schema can be self descriptive (because there are the 'description' fields), then this should be completely enough to even provide the documentation and at the same time, that schema will be the actual schema publisher will use for validation.
I was looking into a Sphinx plugin to render documentation from the JSON schema using its 'description' fields, so storing the docs in it make sense. Storing them online is also useful for cross-language tooling and is a feature of JSON schema.
I think this is much more chilled out approach which is, at the same time, solving the problem with publisher sending something else than he/she thinks he/she is sending. I think that was the main problem we were having, no?
It's *a* problem we're having, but there are other problems as I've enumerated in this thread. If you are not having those problems and you're okay with minimal integration with other services like notifications then there are innumerable ways to send messages. This library is completely optional. You just need to have an AMQP client library and add a few basic headers so clients using the fedora-messaging library don't ignore your messages.
infrastructure@lists.fedoraproject.org