On Mon, Apr 27, 2009 at 07:43:01PM -0800, Jeff Spaleta wrote:
On Mon, Apr 27, 2009 at 6:30 PM, Jeff Spaleta jspaleta@gmail.com wrote:
On Mon, Apr 27, 2009 at 6:26 PM, Paul W. Frields stickster@gmail.com wrote:
I did two counts, one for DVD and Live without uniq'ing the IP addresses doing the retrievals (because there could be multiple downloads from people behind firewalls), and one with. In both cases I think the numbers are very significantly higher than our current stats show. The raw numbers per day since F10 release are attached.
All those numbers in the txt use the "corrected" approach?
Just to be clear.. what where the stats showing before? Just the unique DVD counts?
The download numbers before were using the direct download command shown on this wiki page: https://fedoraproject.org/wiki/Statistics/Commands
There was at least one major problem with that command; the Fedora 10 Live ISO filenames don't start with "Fedora-10," they start with "F10," meaning we weren't counting them *at all*. So I've included two counts, one for the DVD and one for the Live ISO as clicked from the get-fedora page. (Other spins, as far as I can tell, are done via torrent and we're capturing those statistics from the tracker elsewhere.)
The second problem may not be a real problem -- it's that we are uniq-ing the IP addresses doing the downloading. That method has the potential to cut out legitimate, repetitive downloads from inside a firewall. I'd feel better cutting those ticks out if they were separated by a very short timeframe. Then we could be reasonably certain they were caused by repeated clicks, rather than actual separate downloads. In the interest of a conservative approach, I'm willing to stick with uniq-ing the stats, but I produced both sets of numbers anyway.
On Tue, Apr 28, 2009 at 8:29 AM, Paul W. Frields stickster@gmail.com wrote:
uniq-ing the IP addresses doing the downloading. That method has the potential to cut out legitimate, repetitive downloads from inside a firewall. I'd feel better cutting those ticks out if they were
I'm not sure if you can see this in our logs or not (you might have to have the individual mirrors logs :( ), but if the response code is a 206, that means it was a RANGE request - to download part of a file. It's not at all uncommon for a download manager to open 20-30 connections to download the same file for the same user.,
So I'd opt for the conservative approach of uniques as well.
On Tue, Apr 28, 2009 at 08:44:15AM -0400, Jon Stanley wrote:
On Tue, Apr 28, 2009 at 8:29 AM, Paul W. Frields stickster@gmail.com wrote:
uniq-ing the IP addresses doing the downloading. That method has the potential to cut out legitimate, repetitive downloads from inside a firewall. I'd feel better cutting those ticks out if they were
I'm not sure if you can see this in our logs or not (you might have to have the individual mirrors logs :( ), but if the response code is a 206, that means it was a RANGE request - to download part of a file. It's not at all uncommon for a download manager to open 20-30 connections to download the same file for the same user.,
So I'd opt for the conservative approach of uniques as well.
To clarify, I'm already filtering these out on a 302 code. How would that change your opinion, if at all?
On Tue, Apr 28, 2009 at 8:57 AM, Paul W. Frields stickster@gmail.com wrote:
To clarify, I'm already filtering these out on a 302 code. How would that change your opinion, if at all?
302's are just redirects, which is all we do. What I'm not sure of is if the download managers would be intelligent enough to hit the URL that they're redirected to 20-30 times, or if they hit us 20-30 times. I guess it probably depends on which download manager is in use
On Tue, Apr 28, 2009 at 08:57:50AM -0400, Paul W. Frields wrote:
On Tue, Apr 28, 2009 at 08:44:15AM -0400, Jon Stanley wrote:
On Tue, Apr 28, 2009 at 8:29 AM, Paul W. Frields stickster@gmail.com wrote:
uniq-ing the IP addresses doing the downloading. That method has the potential to cut out legitimate, repetitive downloads from inside a firewall. I'd feel better cutting those ticks out if they were
I'm not sure if you can see this in our logs or not (you might have to have the individual mirrors logs :( ), but if the response code is a 206, that means it was a RANGE request - to download part of a file. It's not at all uncommon for a download manager to open 20-30 connections to download the same file for the same user.,
So I'd opt for the conservative approach of uniques as well.
To clarify, I'm already filtering these out on a 302 code. How would that change your opinion, if at all?
Isn't 302 a (temp) relocation? Why would filtering out 302 also filter out 206?
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
On Tue, Apr 28, 2009 at 04:52:20PM +0300, Axel Thimm wrote:
On Tue, Apr 28, 2009 at 08:57:50AM -0400, Paul W. Frields wrote:
On Tue, Apr 28, 2009 at 08:44:15AM -0400, Jon Stanley wrote:
On Tue, Apr 28, 2009 at 8:29 AM, Paul W. Frields stickster@gmail.com wrote:
uniq-ing the IP addresses doing the downloading. That method has the potential to cut out legitimate, repetitive downloads from inside a firewall. I'd feel better cutting those ticks out if they were
I'm not sure if you can see this in our logs or not (you might have to have the individual mirrors logs :( ), but if the response code is a 206, that means it was a RANGE request - to download part of a file. It's not at all uncommon for a download manager to open 20-30 connections to download the same file for the same user.,
So I'd opt for the conservative approach of uniques as well.
To clarify, I'm already filtering these out on a 302 code. How would that change your opinion, if at all?
Isn't 302 a (temp) relocation? Why would filtering out 302 also filter out 206?
I misused a pronoun -- I am specifically *grepping for 302*, which I did not state clearly. Did anyone look at the commands page I linked earlier?
https://fedoraproject.org/wiki/Statistics/Commands
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
Command suggestions are very welcome. I really don't have time to delve into this incredibly deeply at the moment.
On Tue, Apr 28, 2009 at 11:17:55AM -0400, Paul W. Frields wrote:
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
Command suggestions are very welcome. I really don't have time to delve into this incredibly deeply at the moment.
Could you post a small fragment of the raw logs? But it seems like all we see are the primary redirects, possibly w/o noting ranges in the logs. E.g. for better statistics the logs would need to capture the ranges from the head as well.
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 04/29/2009 08:20 PM, Axel Thimm wrote:
On Tue, Apr 28, 2009 at 11:17:55AM -0400, Paul W. Frields wrote:
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
Command suggestions are very welcome. I really don't have time to delve into this incredibly deeply at the moment.
Could you post a small fragment of the raw logs? But it seems like all we see are the primary redirects, possibly w/o noting ranges in the logs. E.g. for better statistics the logs would need to capture the ranges from the head as well.
I think that information is not available, because no one (almost) downloads directly from the Fedora servers, only from the mirrors. Therefore, the real data, like range requests, etc., would only be on the mirror servers themselves, and not on the primary ones.
On Wed, Apr 29, 2009 at 09:09:27PM +0800, Basil Mohamed Gohar wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 04/29/2009 08:20 PM, Axel Thimm wrote:
On Tue, Apr 28, 2009 at 11:17:55AM -0400, Paul W. Frields wrote:
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
Command suggestions are very welcome. I really don't have time to delve into this incredibly deeply at the moment.
Could you post a small fragment of the raw logs? But it seems like all we see are the primary redirects, possibly w/o noting ranges in the logs. E.g. for better statistics the logs would need to capture the ranges from the head as well.
I think that information is not available, because no one (almost) downloads directly from the Fedora servers, only from the mirrors. Therefore, the real data, like range requests, etc., would only be on the mirror servers themselves, and not on the primary ones.
Isn't a range request sent in the header of the HTTP request which would hit the Fedora servers before being redirected?
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 04/30/2009 02:40 AM, Axel Thimm wrote:
On Wed, Apr 29, 2009 at 09:09:27PM +0800, Basil Mohamed Gohar wrote:
On 04/29/2009 08:20 PM, Axel Thimm wrote:
On Tue, Apr 28, 2009 at 11:17:55AM -0400, Paul W. Frields wrote:
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
Command suggestions are very welcome. I really don't have time to delve into this incredibly deeply at the moment.
Could you post a small fragment of the raw logs? But it seems like all we see are the primary redirects, possibly w/o noting ranges in the logs. E.g. for better statistics the logs would need to capture the ranges from the head as well.
I think that information is not available, because no one (almost) downloads directly from the Fedora servers, only from the mirrors. Therefore, the real data, like range requests, etc., would only be on the mirror servers themselves, and not on the primary ones.
Isn't a range request sent in the header of the HTTP request which would hit the Fedora servers before being redirected?
Not if the range request was after being redirected. I think, in the same download, the redirect is only processed once, even if it is a temporary one. Therefore, unless the first request was a partial content one, then the connection has already been redirected to the mirror server, and thus the main Fedora server will not see it.
I'd draw a diagram, but I'm terrible at doing so.
On Wed, Apr 29, 2009 at 09:40:40PM +0300, Axel Thimm wrote:
On Wed, Apr 29, 2009 at 09:09:27PM +0800, Basil Mohamed Gohar wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 04/29/2009 08:20 PM, Axel Thimm wrote:
On Tue, Apr 28, 2009 at 11:17:55AM -0400, Paul W. Frields wrote:
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
Command suggestions are very welcome. I really don't have time to delve into this incredibly deeply at the moment.
Could you post a small fragment of the raw logs? But it seems like all we see are the primary redirects, possibly w/o noting ranges in the logs. E.g. for better statistics the logs would need to capture the ranges from the head as well.
I think that information is not available, because no one (almost) downloads directly from the Fedora servers, only from the mirrors. Therefore, the real data, like range requests, etc., would only be on the mirror servers themselves, and not on the primary ones.
Isn't a range request sent in the header of the HTTP request which would hit the Fedora servers before being redirected?
Can someone on the Infrastructure guru team help me pull some relevant lines from the logs, expurgating the IP address and any other identifying information so we're not running afoul of any privacy concerns?
On Wed, 29 Apr 2009, Paul W. Frields wrote:
On Wed, Apr 29, 2009 at 09:40:40PM +0300, Axel Thimm wrote:
On Wed, Apr 29, 2009 at 09:09:27PM +0800, Basil Mohamed Gohar wrote:
-----BEGIN PGP SIGNED MESSAGE----- Hash: SHA1
On 04/29/2009 08:20 PM, Axel Thimm wrote:
On Tue, Apr 28, 2009 at 11:17:55AM -0400, Paul W. Frields wrote:
Maybe the cleanest solution is to count downloaded bytes and divide by image size. That way you properly count ranged downloads.
Command suggestions are very welcome. I really don't have time to delve into this incredibly deeply at the moment.
Could you post a small fragment of the raw logs? But it seems like all we see are the primary redirects, possibly w/o noting ranges in the logs. E.g. for better statistics the logs would need to capture the ranges from the head as well.
I think that information is not available, because no one (almost) downloads directly from the Fedora servers, only from the mirrors. Therefore, the real data, like range requests, etc., would only be on the mirror servers themselves, and not on the primary ones.
Isn't a range request sent in the header of the HTTP request which would hit the Fedora servers before being redirected?
Can someone on the Infrastructure guru team help me pull some relevant lines from the logs, expurgating the IP address and any other identifying information so we're not running afoul of any privacy concerns?
255.255.255.255 - - [22/Mar/2009:23:59:44 +0000] "GET /pub/fedora/linux/releases/10/Live/i686/F10-i686-Live.iso HTTP/1.1" 302 -"http://fedoraproject.org/en/get-fedora" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)"
Bam!
-Mike
On Wed, Apr 29, 2009 at 06:59:17PM -0500, Mike McGrath wrote:
On Wed, 29 Apr 2009, Paul W. Frields wrote:
On Wed, Apr 29, 2009 at 09:40:40PM +0300, Axel Thimm wrote:
Isn't a range request sent in the header of the HTTP request which would hit the Fedora servers before being redirected?
Can someone on the Infrastructure guru team help me pull some relevant lines from the logs, expurgating the IP address and any other identifying information so we're not running afoul of any privacy concerns?
255.255.255.255 - - [22/Mar/2009:23:59:44 +0000] "GET /pub/fedora/linux/releases/10/Live/i686/F10-i686-Live.iso HTTP/1.1" 302 -"http://fedoraproject.org/en/get-fedora" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)"
Bam!
Is there one that includes a range request of the kind Axel talks about? Sorry to be dense.
On Thu, 30 Apr 2009, Paul W. Frields wrote:
On Wed, Apr 29, 2009 at 06:59:17PM -0500, Mike McGrath wrote:
On Wed, 29 Apr 2009, Paul W. Frields wrote:
On Wed, Apr 29, 2009 at 09:40:40PM +0300, Axel Thimm wrote:
Isn't a range request sent in the header of the HTTP request which would hit the Fedora servers before being redirected?
Can someone on the Infrastructure guru team help me pull some relevant lines from the logs, expurgating the IP address and any other identifying information so we're not running afoul of any privacy concerns?
255.255.255.255 - - [22/Mar/2009:23:59:44 +0000] "GET /pub/fedora/linux/releases/10/Live/i686/F10-i686-Live.iso HTTP/1.1" 302 -"http://fedoraproject.org/en/get-fedora" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)"
Bam!
Is there one that includes a range request of the kind Axel talks about? Sorry to be dense.
Nope.
-Mike
On Thu, Apr 30, 2009 at 10:15 AM, Mike McGrath mmcgrath@redhat.com wrote:
Nope.
To be more specific, this is all we have in the logs for yesterday from proxy1:
[jstanley@log1 http]$ cat download.fedoraproject.org-access.log | awk '{print $9}' | sort -n | uniq -c 23872 302 24961 404 31 503
On Thu, Apr 30, 2009 at 09:15:32AM -0500, Mike McGrath wrote:
On Thu, 30 Apr 2009, Paul W. Frields wrote:
On Wed, Apr 29, 2009 at 06:59:17PM -0500, Mike McGrath wrote:
On Wed, 29 Apr 2009, Paul W. Frields wrote:
On Wed, Apr 29, 2009 at 09:40:40PM +0300, Axel Thimm wrote:
Isn't a range request sent in the header of the HTTP request which would hit the Fedora servers before being redirected?
Can someone on the Infrastructure guru team help me pull some relevant lines from the logs, expurgating the IP address and any other identifying information so we're not running afoul of any privacy concerns?
255.255.255.255 - - [22/Mar/2009:23:59:44 +0000] "GET /pub/fedora/linux/releases/10/Live/i686/F10-i686-Live.iso HTTP/1.1" 302 -"http://fedoraproject.org/en/get-fedora" "Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1; .NET CLR 1.1.4322)"
Bam!
Is there one that includes a range request of the kind Axel talks about? Sorry to be dense.
Nope.
Which doesn't mean you can't put them there: http://httpd.apache.org/docs/2.2/mod/mod_log_config.html#formats
e.g. a %{Range}i and %{If-Range}i in a CustomLog would yield the Range headers in the logs sent with the non-redirected request.
Next one would need to examine the behaviour of popular download accellerators and check for their pattern in these fields. Most prominently whether the last HTTP request logged has them or not.
You wouldn't be able to recover past information, of course (although one could extrapolate the percentage of download accelerators back in time).
infrastructure@lists.fedoraproject.org