Code Search for Fedora

List overview All Threads
Download

newer

older

Meeting Agenda Item: Introduction...

Freeze break: fas2.py hotfix

Michael Stapelberg

18 Nov 2014 18 Nov '14

3 p.m.

Hey,

Recently I’ve been talking to Hannes (cc'ed) about whether Fedora would be interested in having the equivalent of http://codesearch.debian.net/%C2%B9

The project came to live as my Bachelor of Science Thesis² and aims to provide fast regular expression search over a big corpus, in this case 140 GB of source code of all software included in the Debian main distribution (as opposed to non-free or contrib, which we excluded because of licensing concerns). It is based on the work Russ Cox published, which in turn resembles the work he did on Google Code Search when he was an intern there in 2006.

So, what’s this discussion about?

What I’m offering is setting up/running a public version of Code Search for Fedora. It needs to be public because I want the open source community as a whole profit from it, and also I’m told you have somewhat comparable tools internally anyway :).

My motivation comes from multiple places:

1) I’m fairly sure Fedora packages a slightly different set of software than Debian, so running both DCS (Debian Code Search) and FCS (Fedora Code Search) would enlarge the amount of searchable software.

2) I’m interested in my work having a positive effect on the world (or at least the open source community), and running multiple instances of Code Search reduces its dependency on any single distribution, thereby increasing its reliability and scope.

3) Last but not least, I intend to try Fedora on one of my computers to broaden my horizons. I figured getting in contact with some of you while working on this project may be a good way to set a foot into the community and see whether I like it around here.

In terms of what I’d need in order to make this project a success, there are some hardware requirements (aside from, of course, time and motivation):

The in-memory index and searchable source code can be sharded on an almost arbitrary number of different computers, which is necessary to some extent, due to maximum size limitations for the index of a single shard to be < 2 GB. At the moment, we are running 6 different index-backend VMs, each serving 1.8G in-memory indexes and about 40G of source code (including partial indexes). In order to grep through the source quickly, the source is stored on local SSDs (as opposed to a network block storage volume, or even regular HDDs).

In addition to the actual data, we also need a web frontend to serve and combine this data, and we have one more VM which scrapes monitoring information and shows nice graphs about how the whole system behaves.

So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G of RAM (for 2G of index + 2G page cache for grepping files) and 40G SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and also an SSD for caching entire query results. The monitoring VM needs just one core and 2G of RAM.

Does that sound reasonable and feasible? I’m not sure what kind of hardware you have available for projects like this one, and currently we’re sponsored by Rackspace because Debian doesn’t have that sort of hardware easily available.

I feel like this email is long enough already, so I’ll just ask a general: what do you think? Do you need any more information? Please just ask, and keep me CC'ed, since I’m not subscribed to this list.

Thanks in advance, Best regards, Michael Stapelberg

¹ Note that there is a rather big redesign in progress, both architecturally and visually: https://people.debian.org/~stapelberg//2014/11/09/upcoming-debian-codesearch...

So, in case you browse around on the current version and conclude that it sucks, just wait for the update and everything will be awesome ;).

² http://codesearch.debian.net/research/

Show replies by date

Kevin Fenzi

18 Nov 18 Nov

3:16 p.m.

On Tue, 18 Nov 2014 13:00:22 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...

Hey,

Greetings.

...

Recently I’ve been talking to Hannes (cc'ed) about whether Fedora would be interested in having the equivalent of http://codesearch.debian.net/%C2%B9

The project came to live as my Bachelor of Science Thesis² and aims to provide fast regular expression search over a big corpus, in this case 140 GB of source code of all software included in the Debian main distribution (as opposed to non-free or contrib, which we excluded because of licensing concerns). It is based on the work Russ Cox published, which in turn resembles the work he did on Google Code Search when he was an intern there in 2006.

So, what’s this discussion about?

What I’m offering is setting up/running a public version of Code Search for Fedora. It needs to be public because I want the open source community as a whole profit from it, and also I’m told you have somewhat comparable tools internally anyway :).

We have talked about a code search type application several times in the past, but never got as far as coding.

Some things to note about our infrastructure:

Everything we use must be under a free license: https://fedoraproject.org/wiki/Infrastructure_Licensing (which I don't think will be a problem, just noting it. ;)

We have a process for bringing up new applications, called "Request For Resources": https://fedoraproject.org/wiki/Request_For_Resources?rd=Infrastructure/RFR

Through this process we make sure there's more than one person that knows how the application works and can fix it, it's monitored right, etc.

...

My motivation comes from multiple places:

I’m fairly sure Fedora packages a slightly different set of

software than Debian, so running both DCS (Debian Code Search) and FCS (Fedora Code Search) would enlarge the amount of searchable software.

Probibly true. Also, possibly differing versions...

...

I’m interested in my work having a positive effect on the world (or

at least the open source community), and running multiple instances of Code Search reduces its dependency on any single distribution, thereby increasing its reliability and scope.

Reasonable.

...

Last but not least, I intend to try Fedora on one of my computers

to broaden my horizons. I figured getting in contact with some of you while working on this project may be a good way to set a foot into the community and see whether I like it around here.

Welcome. :) Hope you like it

...

In terms of what I’d need in order to make this project a success, there are some hardware requirements (aside from, of course, time and motivation):

The in-memory index and searchable source code can be sharded on an almost arbitrary number of different computers, which is necessary to some extent, due to maximum size limitations for the index of a single shard to be < 2 GB. At the moment, we are running 6 different index-backend VMs, each serving 1.8G in-memory indexes and about 40G of source code (including partial indexes). In order to grep through the source quickly, the source is stored on local SSDs (as opposed to a network block storage volume, or even regular HDDs).

We currently don't have any SSD's. ;(

...

In addition to the actual data, we also need a web frontend to serve and combine this data, and we have one more VM which scrapes monitoring information and shows nice graphs about how the whole system behaves.

So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G of RAM (for 2G of index + 2G page cache for grepping files) and 40G SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and also an SSD for caching entire query results. The monitoring VM needs just one core and 2G of RAM.

Does that sound reasonable and feasible? I’m not sure what kind of hardware you have available for projects like this one, and currently we’re sponsored by Rackspace because Debian doesn’t have that sort of hardware easily available.

Well, we don't have any virthosts with SSD's currently, so that could be a hangup. We do have virthosts and memory/SAS disks.

...

I feel like this email is long enough already, so I’ll just ask a general: what do you think? Do you need any more information? Please just ask, and keep me CC'ed, since I’m not subscribed to this list.

I think before we go looking into hardware requirements, we should discuss the software? Whats it written in? Is there a bunch of people who work on it? or just you?

We would want it packaged up as rpms for deployment, preferably for epel7 (to work on rhel7 hosts).

Would you be open to changes in code/architecture to meet our setup better?

Again, welcome...

kevin

Michael Stapelberg

3:24 p.m.

Thanks for your quick reply!

On Tue, Nov 18, 2014 at 1:16 PM, Kevin Fenzi kevin@scrye.com wrote:

...

On Tue, 18 Nov 2014 13:00:22 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...
Hey,

Greetings.

...
Recently I’ve been talking to Hannes (cc'ed) about whether Fedora would be interested in having the equivalent of http://codesearch.debian.net/%C2%B9

The project came to live as my Bachelor of Science Thesis² and aims to provide fast regular expression search over a big corpus, in this case 140 GB of source code of all software included in the Debian main distribution (as opposed to non-free or contrib, which we excluded because of licensing concerns). It is based on the work Russ Cox published, which in turn resembles the work he did on Google Code Search when he was an intern there in 2006.

So, what’s this discussion about?

What I’m offering is setting up/running a public version of Code Search for Fedora. It needs to be public because I want the open source community as a whole profit from it, and also I’m told you have somewhat comparable tools internally anyway :).

We have talked about a code search type application several times in the past, but never got as far as coding.

Some things to note about our infrastructure:

Everything we use must be under a free license: https://fedoraproject.org/wiki/Infrastructure_Licensing (which I don't think will be a problem, just noting it. ;)

Yep, that’s certainly the case. See https://github.com/Debian/dcs/blob/master/LICENSE

...

We have a process for bringing up new applications, called "Request For Resources": https://fedoraproject.org/wiki/Request_For_Resources?rd=Infrastructure/RFR

Through this process we make sure there's more than one person that knows how the application works and can fix it, it's monitored right, etc.

I’ve had very quick glance only so far, but the general idea sounds reasonable. I’m not sure who’d want to work with me on the project, but perhaps we can find someone who’s interested.

...

...
My motivation comes from multiple places:

I’m fairly sure Fedora packages a slightly different set of

software than Debian, so running both DCS (Debian Code Search) and FCS (Fedora Code Search) would enlarge the amount of searchable software.

Probibly true. Also, possibly differing versions...

...

I’m interested in my work having a positive effect on the world (or

at least the open source community), and running multiple instances of Code Search reduces its dependency on any single distribution, thereby increasing its reliability and scope.

Reasonable.

...

Last but not least, I intend to try Fedora on one of my computers

to broaden my horizons. I figured getting in contact with some of you while working on this project may be a good way to set a foot into the community and see whether I like it around here.

Welcome. :) Hope you like it

...
In terms of what I’d need in order to make this project a success, there are some hardware requirements (aside from, of course, time and motivation):

The in-memory index and searchable source code can be sharded on an almost arbitrary number of different computers, which is necessary to some extent, due to maximum size limitations for the index of a single shard to be < 2 GB. At the moment, we are running 6 different index-backend VMs, each serving 1.8G in-memory indexes and about 40G of source code (including partial indexes). In order to grep through the source quickly, the source is stored on local SSDs (as opposed to a network block storage volume, or even regular HDDs).

We currently don't have any SSD's. ;(

...
In addition to the actual data, we also need a web frontend to serve and combine this data, and we have one more VM which scrapes monitoring information and shows nice graphs about how the whole system behaves.

So, in total, we run 8 VMs, of which 6 are equipped with 4 cores, 4G of RAM (for 2G of index + 2G page cache for grepping files) and 40G SSD volumes each. The web frontend uses 4 cores and 2G of RAM, and also an SSD for caching entire query results. The monitoring VM needs just one core and 2G of RAM.

Does that sound reasonable and feasible? I’m not sure what kind of hardware you have available for projects like this one, and currently we’re sponsored by Rackspace because Debian doesn’t have that sort of hardware easily available.

Well, we don't have any virthosts with SSD's currently, so that could be a hangup. We do have virthosts and memory/SAS disks.

That’s a bummer. How many IOPS do your SAS disks provide? Is there any chance that you could get some SSDs in the near to mid term future?

...

...
I feel like this email is long enough already, so I’ll just ask a general: what do you think? Do you need any more information? Please just ask, and keep me CC'ed, since I’m not subscribed to this list.

I think before we go looking into hardware requirements, we should discuss the software? Whats it written in? Is there a bunch of people who work on it? or just you?

It’s written in Go, and mostly I’m working on it, with a few random contributions from other people from time to time.

...

We would want it packaged up as rpms for deployment, preferably for epel7 (to work on rhel7 hosts).

Yeah, I’ve heard about that, and it shouldn’t be a problem, I think. I assume the Go compiler is in EPEL7.

...

Would you be open to changes in code/architecture to meet our setup better?

Of course, yeah.

Kevin Fenzi

4:12 p.m.

On Tue, 18 Nov 2014 13:24:02 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...

Thanks for your quick reply!

No problem. ;)

...snip...

...

I’ve had very quick glance only so far, but the general idea sounds reasonable. I’m not sure who’d want to work with me on the project, but perhaps we can find someone who’s interested.

Yeah, we just want to make sure at least a few people know how to debug and fix things. Then it's not all on one persons shoulders.

...snip...

...

That’s a bummer. How many IOPS do your SAS disks provide? Is there any chance that you could get some SSDs in the near to mid term future?

I'd have to test them. ;)

It's possible, but not sure how possible. We are putting together budgets for next year now, but not sure if we would be able to get SSD's for this.

I'm not sure also how big things are. ie, all the unpacked package sources. We could find out, but will take some investigation.

...

...
I think before we go looking into hardware requirements, we should discuss the software? Whats it written in? Is there a bunch of people who work on it? or just you?

It’s written in Go, and mostly I’m working on it, with a few random contributions from other people from time to time.

We are very heavily a python shop here, I don't know if any of our applications folks have really done much with go, but I'm sure they will chime in if they have. ;)

...

...
We would want it packaged up as rpms for deployment, preferably for epel7 (to work on rhel7 hosts).

Yeah, I’ve heard about that, and it shouldn’t be a problem, I think. I assume the Go compiler is in EPEL7.

Yeah, I think so.

...

...
Would you be open to changes in code/architecture to meet our setup better?

Of course, yeah.

There would also be questions around processing the source... in debian do you unpack all the source debs? or ?

Do you follow just some branches of packages? Or all of them? Or just some arches?

Is this just upstream source? or the source + any local patches?

Lots of questions... hope I'm not asking too many dumb ones. ;)

kevin

Michael Stapelberg

19 Nov 19 Nov

1:58 a.m.

On Tue, Nov 18, 2014 at 2:12 PM, Kevin Fenzi kevin@scrye.com wrote:

...

On Tue, 18 Nov 2014 13:24:02 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...
Thanks for your quick reply!

No problem. ;)

...snip...

...
I’ve had very quick glance only so far, but the general idea sounds reasonable. I’m not sure who’d want to work with me on the project, but perhaps we can find someone who’s interested.

Yeah, we just want to make sure at least a few people know how to debug and fix things. Then it's not all on one persons shoulders.

...snip...

...
That’s a bummer. How many IOPS do your SAS disks provide? Is there any chance that you could get some SSDs in the near to mid term future?

I'd have to test them. ;)

You could install fio(1) and run it with this config file:

[global] bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10 rw=randread direct=1 blocksize=4k size=500m ioengine=libaio iodepth=1 #write_bw_log #write_lat_log numjobs=1

[sda] filename=/dev/xvdb

Then, compare the results to http://p.nnev.de/4763 — if they are roughly comparable, that might work.

...

It's possible, but not sure how possible. We are putting together budgets for next year now, but not sure if we would be able to get SSD's for this.

I'm not sure also how big things are. ie, all the unpacked package sources. We could find out, but will take some investigation.

Agreed.

...

...
...
I think before we go looking into hardware requirements, we should discuss the software? Whats it written in? Is there a bunch of people who work on it? or just you?

It’s written in Go, and mostly I’m working on it, with a few random contributions from other people from time to time.

We are very heavily a python shop here, I don't know if any of our applications folks have really done much with go, but I'm sure they will chime in if they have. ;)

I see. Go is pretty easy to pick up, and the vast majority of people in my filter bubble who gave it a try found it pleasant to work with.

...

...
...
We would want it packaged up as rpms for deployment, preferably for epel7 (to work on rhel7 hosts).

Yeah, I’ve heard about that, and it shouldn’t be a problem, I think. I assume the Go compiler is in EPEL7.

Yeah, I think so.

...
...
Would you be open to changes in code/architecture to meet our setup better?

Of course, yeah.

There would also be questions around processing the source... in debian do you unpack all the source debs? or ?

In Debian we operate on all source packages of the main distribution’s unstable (“sid”) version.

...

Do you follow just some branches of packages? Or all of them?

We only use sid, mostly because it was simple to implement Code Search with just a single corpus. I’ve been asked once whether I could also provide other suites, but that’s a project for the future, if at all necessary. The rationale so far is that we’re mostly interested in code that’s in the development version, because code that is _only_ in the older versions is likely obsolete in some way.

...

Or just some arches?

Since we’re talking about source packages, they are not architecture-specific at all :).

...

Is this just upstream source? or the source + any local patches?

We’re unpacking the upstream source and then applying the Debian diff so that we have the package indexed the same way our build daemons see them. So, yes, this includes applying local patches.

...

Lots of questions... hope I'm not asking too many dumb ones. ;)

Perfectly reasonable questions so far :). I hope my reply makes things clearer, but don’t hesitate to ask if you have further questions.

Kevin Fenzi

4:29 p.m.

On Tue, 18 Nov 2014 23:58:33 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...

On Tue, Nov 18, 2014 at 2:12 PM, Kevin Fenzi kevin@scrye.com wrote:

...
On Tue, 18 Nov 2014 13:24:02 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...
Thanks for your quick reply!

No problem. ;)

...snip...

...
I’ve had very quick glance only so far, but the general idea sounds reasonable. I’m not sure who’d want to work with me on the project, but perhaps we can find someone who’s interested.

Yeah, we just want to make sure at least a few people know how to debug and fix things. Then it's not all on one persons shoulders.

...snip...

...
That’s a bummer. How many IOPS do your SAS disks provide? Is there any chance that you could get some SSDs in the near to mid term future?

I'd have to test them. ;)

You could install fio(1) and run it with this config file:

[global] bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10 rw=randread direct=1 blocksize=4k size=500m ioengine=libaio iodepth=1 #write_bw_log #write_lat_log numjobs=1

[sda] filename=/dev/xvdb

Then, compare the results to http://p.nnev.de/4763 — if they are roughly comparable, that might work.

I can try... will see if I can find a host thats not in production right now, don't want to cause problems. ;)

...snip...

...

...
There would also be questions around processing the source... in debian do you unpack all the source debs? or ?

In Debian we operate on all source packages of the main distribution’s unstable (“sid”) version.

ok.

...

...
Do you follow just some branches of packages? Or all of them?

We only use sid, mostly because it was simple to implement Code Search with just a single corpus. I’ve been asked once whether I could also provide other suites, but that’s a project for the future, if at all necessary. The rationale so far is that we’re mostly interested in code that’s in the development version, because code that is _only_ in the older versions is likely obsolete in some way.

Fair enough.

...

...
Or just some arches?

Since we’re talking about source packages, they are not architecture-specific at all :).

Well, one fun thing is that rpm srpms can be sorta. ;) Depending on what arch they were generated on they could be different if there's different arch sections to the spec.

...

...
Is this just upstream source? or the source + any local patches?

We’re unpacking the upstream source and then applying the Debian diff so that we have the package indexed the same way our build daemons see them. So, yes, this includes applying local patches.

ok. Is there distinction between the upstream part and the patched part? Or it's all just source to index at that point...

...

...
Lots of questions... hope I'm not asking too many dumb ones. ;)

Perfectly reasonable questions so far :). I hope my reply makes things clearer, but don’t hesitate to ask if you have further questions.

Sure. thanks.

kevin

Michael Stapelberg

20 Nov 20 Nov

2:48 a.m.

On Wed, Nov 19, 2014 at 2:29 PM, Kevin Fenzi kevin@scrye.com wrote:

...

On Tue, 18 Nov 2014 23:58:33 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...
On Tue, Nov 18, 2014 at 2:12 PM, Kevin Fenzi kevin@scrye.com wrote:

...
On Tue, 18 Nov 2014 13:24:02 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...
Thanks for your quick reply!

No problem. ;)

...snip...

...
I’ve had very quick glance only so far, but the general idea sounds reasonable. I’m not sure who’d want to work with me on the project, but perhaps we can find someone who’s interested.

Yeah, we just want to make sure at least a few people know how to debug and fix things. Then it's not all on one persons shoulders.

...snip...

...
That’s a bummer. How many IOPS do your SAS disks provide? Is there any chance that you could get some SSDs in the near to mid term future?

I'd have to test them. ;)

You could install fio(1) and run it with this config file:

[global] bssplit=512/10:1k/5:2k/5:4k/60:8k/2:16k/4:32k/4:64k/10 rw=randread direct=1 blocksize=4k size=500m ioengine=libaio iodepth=1 #write_bw_log #write_lat_log numjobs=1

[sda] filename=/dev/xvdb

Then, compare the results to http://p.nnev.de/4763 — if they are roughly comparable, that might work.

I can try... will see if I can find a host thats not in production right now, don't want to cause problems. ;)

...snip...

...
...
There would also be questions around processing the source... in debian do you unpack all the source debs? or ?

In Debian we operate on all source packages of the main distribution’s unstable (“sid”) version.

ok.

...
...
Do you follow just some branches of packages? Or all of them?

We only use sid, mostly because it was simple to implement Code Search with just a single corpus. I’ve been asked once whether I could also provide other suites, but that’s a project for the future, if at all necessary. The rationale so far is that we’re mostly interested in code that’s in the development version, because code that is _only_ in the older versions is likely obsolete in some way.

Fair enough.

...
...
Or just some arches?

Since we’re talking about source packages, they are not architecture-specific at all :).

Well, one fun thing is that rpm srpms can be sorta. ;) Depending on what arch they were generated on they could be different if there's different arch sections to the spec.

Interesting. Do you have a pointer to an example (and/or more documentation) for this happening in the wild?

...

...
...
Is this just upstream source? or the source + any local patches?

We’re unpacking the upstream source and then applying the Debian diff so that we have the package indexed the same way our build daemons see them. So, yes, this includes applying local patches.

ok. Is there distinction between the upstream part and the patched part? Or it's all just source to index at that point...

There is no distinction. Applying the patches happens before we index the source.

...

...
...
Lots of questions... hope I'm not asking too many dumb ones. ;)

Perfectly reasonable questions so far :). I hope my reply makes things clearer, but don’t hesitate to ask if you have further questions.

Sure. thanks.

kevin

Florian Weimer

22 Nov 22 Nov

2:51 p.m.

* Michael Stapelberg:

...

Interesting. Do you have a pointer to an example (and/or more documentation) for this happening in the wild?

glibc has architecture-specific build dependencies.

However, many packages use non-trivial things for unpacking the sources (the %prep stage), and you have to install the build depencies before you can complete the %prep stage. I have written a mock wrapper to do this, but it's not completely reliable due to dangling symbolic links:

https://bugzilla.redhat.com/show_bug.cgi?id=966985

Kevin Fenzi

23 Nov 23 Nov

11:42 a.m.

On Sat, 22 Nov 2014 21:51:37 +0100 Florian Weimer fw@deneb.enyo.de wrote:

...

Michael Stapelberg:

...
Interesting. Do you have a pointer to an example (and/or more documentation) for this happening in the wild?

glibc has architecture-specific build dependencies.

However, many packages use non-trivial things for unpacking the sources (the %prep stage), and you have to install the build depencies before you can complete the %prep stage. I have written a mock wrapper to do this, but it's not completely reliable due to dangling symbolic links:

https://bugzilla.redhat.com/show_bug.cgi?id=966985

A classic case is a package where some optional buildrequires isn't available on some arch, like armv7. So, the spec conditionalizes this buildrequires to not be there on armv7. Now if you make the src.rpm on a armv7 platform that dep won't be there, but if you make it on x86 it will.

kevin

Nick Coghlan

20 Nov 20 Nov

11:54 p.m.

On 11/19/2014 05:58 PM, Michael Stapelberg wrote:

...

On Tue, Nov 18, 2014 at 2:12 PM, Kevin Fenzi kevin@scrye.com wrote:

...
On Tue, 18 Nov 2014 13:24:02 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...
...
I think before we go looking into hardware requirements, we should discuss the software? Whats it written in? Is there a bunch of people who work on it? or just you?

It’s written in Go, and mostly I’m working on it, with a few random contributions from other people from time to time.

We are very heavily a python shop here, I don't know if any of our applications folks have really done much with go, but I'm sure they will chime in if they have. ;)

I see. Go is pretty easy to pick up, and the vast majority of people in my filter bubble who gave it a try found it pleasant to work with.

The various Python folks I know that have done work in Go have all considered it a pleasant language for writing network services.

Another point worth noting is that because Go is precompiled (ala C/C++), working with deployed systems written in it should be more like working with the system components written in C/C++ than with the Python web services.

There's also the fact that given Docker & Kubernetes are both written in Go, it seems likely that it will eventually make in appearance somewhere in Fedora's infrastructure, even independent of this proposal.

Cheers, Nick.

-- Nick Coghlan Red Hat Hosted & Shared Services Software Engineering & Development, Brisbane HSS Provisioning Architect

Matthew Miller

18 Nov 18 Nov

3:30 p.m.

On Tue, Nov 18, 2014 at 01:00:22PM -0800, Michael Stapelberg wrote:

...

What I’m offering is setting up/running a public version of Code Search for Fedora. It needs to be public because I want the open source community as a whole profit from it, and also I’m told you have somewhat comparable tools internally anyway :).

Cool! Without getting into the resourcing issues, in theory this seems really great. There have been a number of times when I've wanted to know if and where some bit of code exists in Fedora, and I've used the debian search and then tried to map to our packages, which is inexact and time consuming.

-- Matthew Miller mattdm@fedoraproject.org Fedora Project Leader

Tim Flink

1 Dec 1 Dec

1:16 p.m.

On Tue, 18 Nov 2014 13:00:22 -0800 Michael Stapelberg michael+fedora@stapelberg.ch wrote:

...

Hey,

Recently I’ve been talking to Hannes (cc'ed) about whether Fedora would be interested in having the equivalent of http://codesearch.debian.net/%C2%B9

The project came to live as my Bachelor of Science Thesis² and aims to provide fast regular expression search over a big corpus, in this case 140 GB of source code of all software included in the Debian main distribution (as opposed to non-free or contrib, which we excluded because of licensing concerns). It is based on the work Russ Cox published, which in turn resembles the work he did on Google Code Search when he was an intern there in 2006.

So, what’s this discussion about?

What I’m offering is setting up/running a public version of Code Search for Fedora. It needs to be public because I want the open source community as a whole profit from it, and also I’m told you have somewhat comparable tools internally anyway :).

Thanks for starting the conversation.

<snip>

...

I feel like this email is long enough already, so I’ll just ask a general: what do you think? Do you need any more information? Please just ask, and keep me CC'ed, since I’m not subscribed to this list.

I'm a little late to the discussion, but I think that code search sounds like a cool idea if we can find the human/machine resources to do it. I've only glanced through all the docs so far but I have a couple of concerns (some of which have already been raised). I hope this doesn't sound like I'm completely against the idea, though - I wouldn't have spent the time to go through your thesis and respond to the discussion if that were the case.

Tim

Single points of (human) failure --------------------------------

Kevin already brought this up but I'm a little worried about supporting a large, complex application like this with only one person familiar with it and few people around familiar with the language that its core is written in. Speaking as someone who has been the single point of failure in an application deployment before, I'd strongly suggest finding someone to help. Finding out that something went down when you were/are on vacation and there's nobody else that can fix it is not fun :)

Node Failure Behavior ---------------------

I'm not clear from the docs I went through how node failure is handled. I don't see any explicit mention of it, so I'm assuming that the index shards all have single copies of the index. How does the system handle failure in one of the index nodes? My first guess is that there would be missing results from queries but I haven't gotten into the actual code yet.

Code resiliency ---------------

I think that this is lessened a bit since the code has been running in production but it sounds like the base code for indexing was meant somewhat as a "proof of concept" or small-scale deployment. It sounds like you've made quite a few enhancements on top of the released google codesearch but tried to leave the core code alone as much as possible. Have you seen many problems in the index nodes for DCS?

Indexing --------

Have there been any complaints/comments about your chosen update delta of 3 days? You assert that 3 days is a good balance between indexing load and keeping fresh code in the index but I don't see a justification in your thesis. How did you come to the conclusion that 3 days was an optimal choice?

Is the indexing process automated or does it need to be kicked off by a human? The way it's described in your thesis ("after verifying that no human mistake was made by confirming that Debian Code Search still delivers results ..."), it sounds somewhat manual.

Is there a downtime when updating the inverted trigram index? If so, how long is that? Does it happen for every re-index? It sounds like the resources for the index nodes would be almost 100% utilized after indexing, leaving no additional resources to handle the re-indexing load. Am I misunderstanding something in the architecture?

Ranking -------

Is the result ranking code compatible with non-debian sources? We don't have an equivalent to popcon and I assume that the reverse dependency factor would need different code in for Fedora than in Debian. Or is this part of the modifications you were planning for already?

3431

Age (days ago)

3444

Last active (days ago)

infrastructure@lists.fedoraproject.org

11 comments

6 participants

tags (0)

participants (6)

Florian Weimer
Kevin Fenzi
Matthew Miller
Michael Stapelberg
Nick Coghlan
Tim Flink