Hey, just a heads up
the package Django has been deprecated or EPEL5 and EPEL6. On EPEL6, you can use Django14 as well, if you don't require Django in version 1.3. Please note, the latter does not receive security updates any more and contains at least one known weakness.
For EL5 it's not that simple. Django-1.1 is ways older; I don't know, how many known security issues exist. Newer Django versions require newer python there.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/06/2013 10:21 AM, Matthias Runge wrote:
Hey, just a heads up
the package Django has been deprecated or EPEL5 and EPEL6. On EPEL6, you can use Django14 as well, if you don't require Django in version 1.3. Please note, the latter does not receive security updates any more and contains at least one known weakness.
For EL5 it's not that simple. Django-1.1 is ways older; I don't know, how many known security issues exist. Newer Django versions require newer python there.
Is there any way we can get usage statistics or a poll of the EPEL 5 users to determine if there is any value in trying to build Django15 atop the python26 stack? I expect this would be quite an undertaking, given all the dependencies...
On 7 June 2013 05:53, Stephen Gallagher sgallagh@redhat.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/06/2013 10:21 AM, Matthias Runge wrote:
Hey, just a heads up
the package Django has been deprecated or EPEL5 and EPEL6. On EPEL6, you can use Django14 as well, if you don't require Django in version 1.3. Please note, the latter does not receive security updates any more and contains at least one known weakness.
For EL5 it's not that simple. Django-1.1 is ways older; I don't know, how many known security issues exist. Newer Django versions require newer python there.
Is there any way we can get usage statistics or a poll of the EPEL 5 users to determine if there is any value in trying to build Django15 atop the python26 stack? I expect this would be quite an undertaking, given all the dependencies...
Not that I know of. The only statistics we have is what repo people are looking for and not what they are looking for in a repo.
On 06/07/2013 09:19 AM, Stephen John Smoogen wrote:
On 7 June 2013 05:53, Stephen Gallagher sgallagh@redhat.com wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 06/06/2013 10:21 AM, Matthias Runge wrote:
Hey, just a heads up
the package Django has been deprecated or EPEL5 and EPEL6. On EPEL6, you can use Django14 as well, if you don't require Django in version 1.3. Please note, the latter does not receive security updates any more and contains at least one known weakness.
For EL5 it's not that simple. Django-1.1 is ways older; I don't know, how many known security issues exist. Newer Django versions require newer python there.
Is there any way we can get usage statistics or a poll of the EPEL 5 users to determine if there is any value in trying to build Django15 atop the python26 stack? I expect this would be quite an undertaking, given all the dependencies...
Not that I know of. The only statistics we have is what repo people are looking for and not what they are looking for in a repo.
Just to continue down this path for a moment, what are the ways we can track or get ideas of what is important to EPEL users? If we could improve what we have already, how would we do so? (Let's pretend for a moment that we can get a developer to help build and maintain something ... what would we have that person do?)
- Karsten
On 7 June 2013 10:48, Karsten 'quaid' Wade kwade@redhat.com wrote:
Just to continue down this path for a moment, what are the ways we can track or get ideas of what is important to EPEL users? If we could improve what we have already, how would we do so? (Let's pretend for a moment that we can get a developer to help build and maintain something ... what would we have that person do?)
Most of the methods would require the user running something that tells us what they have installed.. which usually violates various business rules and/or needs approval via management chain. Since getting anything like a census application would require all kinds of approval it would only get installed on volunteer systems which tend to be atypical in usage.
The easiest way I could see is just get a better sampling method which would be to have funding for a mirror which we then put into mirror-manager and we know that this is a sampling versus a request info. (basically we would see what packages are downloaded directly and then extend that sample from the amount of downloads to the 500,000 systems that check in via mirrormanager). The problems involved are paying for systems, storage, and bandwidth for such items.
On Fri, Jun 07, 2013 at 01:31:36PM -0600, Stephen John Smoogen wrote:
The easiest way I could see is just get a better sampling method which would be to have funding for a mirror which we then put into mirror-manager and we know that this is a sampling versus a request info. (basically we would see what packages are downloaded directly and then extend that sample from the amount of downloads to the 500,000 systems that check in via mirrormanager). The problems involved are paying for systems, storage, and bandwidth for such items.
Maybe one of the mirrors would be able to provide logs?
On 7 June 2013 13:48, Matthew Miller mattdm@fedoraproject.org wrote:
On Fri, Jun 07, 2013 at 01:31:36PM -0600, Stephen John Smoogen wrote:
The easiest way I could see is just get a better sampling method which would be to have funding for a mirror which we then put into
mirror-manager
and we know that this is a sampling versus a request info. (basically we would see what packages are downloaded directly and then extend that
sample
from the amount of downloads to the 500,000 systems that check in via mirrormanager). The problems involved are paying for systems, storage,
and
bandwidth for such items.
Maybe one of the mirrors would be able to provide logs?
Possibly. In the past mirror admins have not wanted to do so for many reasons (can't keep logs longer than 24 hours for policy reasons, can't give over logs without a formal agreement and then with as much redacted as possible, if we do it for X then we have to do it for everyone so no thankyou.) When I was at my university gig, it had to go up 4 levels of management before I gave up at the sub-CIO level.)
I have tried looking at the top level mirrors but most of the data is swamped out by other sites mirroring and lots of people doing development work and pointing to repos directly. This led to some strange statistics where trying to pull out even most of the noise made for various packages to "stand out" until I realized they were pulled in for cross-compiles and such (or the site that likes to do partial mirrors every couple of hours but always pulls in the same 4 packages each time even when it pulls in others.) I am expecting that other mirrors are going to run into that which means that stuff that a lot of sites could give out (just the urls per day) versus the IP address, URL would mean that the data would have a lot of weird noise that makes say zvbi show up high because it is both getting mirrored as the last package on the server and also because 8 packages use it as depends (not true but I can't remember the package that showed up a ton.)
In either case, it is what got me to realize that a mirror is needed to allow for better statistics of this sort because the data can be cleaned as needed versus pre-cleaned and reanimated.
On 06/07/2013 05:08 PM, Stephen John Smoogen wrote:
On 7 June 2013 13:48, Matthew Miller mattdm@fedoraproject.org wrote:
On Fri, Jun 07, 2013 at 01:31:36PM -0600, Stephen John Smoogen wrote:
The easiest way I could see is just get a better sampling method which would be to have funding for a mirror which we then put into
mirror-manager
and we know that this is a sampling versus a request info. (basically we would see what packages are downloaded directly and then extend that
sample
from the amount of downloads to the 500,000 systems that check in via mirrormanager). The problems involved are paying for systems, storage,
and
bandwidth for such items.
Maybe one of the mirrors would be able to provide logs?
Possibly. In the past mirror admins have not wanted to do so for many reasons (can't keep logs longer than 24 hours for policy reasons, can't give over logs without a formal agreement and then with as much redacted as possible, if we do it for X then we have to do it for everyone so no thankyou.) When I was at my university gig, it had to go up 4 levels of management before I gave up at the sub-CIO level.)
I have tried looking at the top level mirrors but most of the data is swamped out by other sites mirroring and lots of people doing development work and pointing to repos directly. This led to some strange statistics where trying to pull out even most of the noise made for various packages to "stand out" until I realized they were pulled in for cross-compiles and such (or the site that likes to do partial mirrors every couple of hours but always pulls in the same 4 packages each time even when it pulls in others.) I am expecting that other mirrors are going to run into that which means that stuff that a lot of sites could give out (just the urls per day) versus the IP address, URL would mean that the data would have a lot of weird noise that makes say zvbi show up high because it is both getting mirrored as the last package on the server and also because 8 packages use it as depends (not true but I can't remember the package that showed up a ton.)
In either case, it is what got me to realize that a mirror is needed to allow for better statistics of this sort because the data can be cleaned as needed versus pre-cleaned and reanimated.
Compelling information, thanks. I might still want to pursue improving the data collection across an existing mirror network, but for now I like your idea of inserting a tracking-mirror in to the system.
I've been doing a lot of thinking lately about mirroring, logs, and anonymity. This is because I think we want to get more data about EPEL usage without raising privacy or other legal concerns. My impetus is simple, EPEL is an enormously important and popular part of the Fedora Project to all of us, and my job is helping make such projects wildly successful. :) To figure out what wild success means and track our progress, we need a better handle on usage.
A tracking-mirror could go something like this:
* Logs are rotated out to the trash regularly, e.g. 24 hours.[1] * Data is gathered from logs in real time in an anonymous fashion, so nothing non-anonymous is inserted in to the database. No connection is retained between the data in the database and the logs not yet thrown away. * The log data gathering process attempts to cleanse in real time before writing to the database. (This aligns with your idea, yes?) * Work closely with the cleansing tools for a period of time to get a handle on the sort of confusions you've experienced; see if programmatic predictions can help keep watch in the future (e.g. alert on unusual spikes in traffic to a small package set with certain patterns such as near-each-other-alphabetically or used-together-often-as-dependencies.) * We use statistical analysis to extrapolate wider conclusions. * We make it possible to grow this tracking-mirror network within the existing mirror network to improve the dataset. * Throughout, code and configurations are dealt with transparently so it is clear to community members not only that a better quality of tracking is happening, but what the results of that tracking are (the analysis itself, actions taken from the analysis that benefits users), and that all details are there showing the protection of privacy.
I'm interested in championing this idea to get the resources (server, bandwidth, peoples, code, etc.) to make at least the initial mirror happen. With the right plan, I could see getting things in place pretty quickly e.g. September.
- Karsten
[1] We could consider sending logs directly to /dev/null after data collection if we felt data collection was sufficient. The main risk there is in reducing the ability to troubleshoot. It's an interesting thought exercise at least to find a way toward dropping non-anonymous information without even a millisecond of retention. Such as, pulling anonymous data to the dataset, then cleansing the stream toward privacy before writing it to the log. For example, it might be sufficient for troubleshooting to know a class C IP block but drop the specific IP address.
On 7 June 2013 19:21, Karsten 'quaid' Wade kwade@redhat.com wrote:
A tracking-mirror could go something like this:
- Logs are rotated out to the trash regularly, e.g. 24 hours.[1]
- Data is gathered from logs in real time in an anonymous fashion, so
nothing non-anonymous is inserted in to the database. No connection is retained between the data in the database and the logs not yet thrown away.
I have been trying to come up with a better way of saying the following but haven't been able to.
Please do not use the word anonymous data. Trying to make data truly anonymous takes a LOT of work with nebulous gain. You have to do more than just change out ip addresses with something else. You have to remove timestamps, shuffle data around, drop some data and duplicate other, and all other kinds of things which done wrong can either not really anonymize the data or make the data worthless to trying to determine what is going on in it. Phd's come up with new methods all the time that fall apart in reality because of some assumption that was forgotten.
We can not promise anonymity, and trying to is not something that I could see happening in a volunteer organization.
Two throwing away logs gets you into trouble because the first thing you find is that you have a new question but you can't answer it with your old data because you weren't logging it. At which point you need 6 months of new data before you can answer that question. Plus logs are useful when you run into other issues like "Hey look someone broke into the system how did they do that?" Cross referencing http/ftp/rsync logs to the breakin usually shows where the attacker was really starting from which can help others. I would say that any logs we keep are kept for X time where X is longer than 6 months and less than 2 years.
If a mirror is set up, it is set up. Data is collected and stored and analyzed following the laws and rules of conduct that are set up for the people who can view and analyze that data. What is published from that follows those laws and rules of conduct also. Going beyond that without a staff of trained and knowledgeable statisticians who have done this sort of thing before is a recipe for disaster.
On 06/08/2013 09:23 AM, Stephen John Smoogen wrote:
On 7 June 2013 19:21, Karsten 'quaid' Wade kwade@redhat.com wrote:
A tracking-mirror could go something like this:
- Logs are rotated out to the trash regularly, e.g. 24 hours.[1]
- Data is gathered from logs in real time in an anonymous fashion, so
nothing non-anonymous is inserted in to the database. No connection is retained between the data in the database and the logs not yet thrown away.
I have been trying to come up with a better way of saying the following but haven't been able to.
OK, I get what you are saying, you make good sense.
Let me go back a few steps to see if I'm trying to solve a problem that doesn't need solving.
We as sysadmins know that the Internet is not designed to be an anonymous place. People may not think about it much, but their daily journeys across the Internet are easily tracked back to them. We can call that a not-well-known fact.
So in thinking about that fact, what I said (and you tore down) doesn't really make sense - it's trying to anonymize information that people aren't intending to be anonymous by the simple fact they are connected to the public Internet. Even if they aren't aware of how easy it is to backtrack on IP connections made, that ease is the nature of the network.
Privacy policies then are just ways of saying what one is or is not going to do with collected non-anonymous data. Perhaps we just have a robust, clear, and well-known privacy policy?
One aspect of anonymity we can't easily ignore is the spectre of a court order coming to open up data protected by that privacy policy. Once we have collected and retained data, our responsibilities around that data seem to go up greatly. Therefore there is a temptation from a certain mindset to retain nothing. What's the best compromise?
In terms of the goal of collecting data to help EPEL, I presume anything we did with analysis would want to include making available the analyzed dataset. Is that possible to do while protecting privacy?
Maybe privacy is the goal more than anonymity? And can we make datasets available by obfuscating certain details to protect privacy? Maybe there is an "anonymous enough" position we can take?
- Karsten
Please do not use the word anonymous data. Trying to make data truly anonymous takes a LOT of work with nebulous gain. You have to do more than just change out ip addresses with something else. You have to remove timestamps, shuffle data around, drop some data and duplicate other, and all other kinds of things which done wrong can either not really anonymize the data or make the data worthless to trying to determine what is going on in it. Phd's come up with new methods all the time that fall apart in reality because of some assumption that was forgotten.
We can not promise anonymity, and trying to is not something that I could see happening in a volunteer organization.
Two throwing away logs gets you into trouble because the first thing you find is that you have a new question but you can't answer it with your old data because you weren't logging it. At which point you need 6 months of new data before you can answer that question. Plus logs are useful when you run into other issues like "Hey look someone broke into the system how did they do that?" Cross referencing http/ftp/rsync logs to the breakin usually shows where the attacker was really starting from which can help others. I would say that any logs we keep are kept for X time where X is longer than 6 months and less than 2 years.
If a mirror is set up, it is set up. Data is collected and stored and analyzed following the laws and rules of conduct that are set up for the people who can view and analyze that data. What is published from that follows those laws and rules of conduct also. Going beyond that without a staff of trained and knowledgeable statisticians who have done this sort of thing before is a recipe for disaster.
epel-devel mailing list epel-devel@lists.fedoraproject.org https://admin.fedoraproject.org/mailman/listinfo/epel-devel
On 8 June 2013 11:27, Karsten 'quaid' Wade kwade@redhat.com wrote:
In terms of the goal of collecting data to help EPEL, I presume anything we did with analysis would want to include making available the analyzed dataset. Is that possible to do while protecting privacy?
That is the rub. Once you go to the point of sharing the data it falls under a whole different set of rules and such than if you keep it. That is where protecting the privacy of the individuals starts coming up and penalties for not doing so become a problem. At this point I would say that any mirror that is set up would probably keep its data private unless it was set up by an organization whose job was to publish such data. [EG if a nonprofit setup to collect the pulse of the internet does it that is different than if Red Hat or something paid by Red Hat does so.]
Maybe privacy is the goal more than anonymity? And can we make datasets available by obfuscating certain details to protect privacy? Maybe there is an "anonymous enough" position we can take?
- Karsten
I do not know enough to comment beyond when it has been brought up in the past most legal answers were to check if I had been smoking pot in the last 20 minutes.
epel-devel@lists.fedoraproject.org