Search Engine project - infrastructure - Fedora mailing-lists

15 Feb 2012


      For some time, we've wanted to explore search engine options, so that you
could go to one site and search all of the Fedora sites that we run.
Examples include: Wiki (wiki search is not that great), docs.fp.o, pkgdb,
etc.
Relative information from last time this was discussed:
https://fedorahosted.org/fedora-infrastructure/ticket/1055
http://fedoraproject.org/wiki/Infrastructure/Search
I've been playing with various options on one of our junkXX boxes, and
seeing what works well.
- I tried Sphinx, but it seems this is really just a database fulltext
search, not a full-out search engine and crawler solution.
- I tried Xapian, but getting it crawling required a lot of hacking and
conversion from an external crawler (e.g. htdig), and htdig kept throwing
traces and dying, on https sites.
- I tried mnoGoSearch, its CGI would not work at all. It would simply
timeout when I tried to go to it.
- I lastly tried Datapark Search, which seems like our best bet:
    - I ran into an issue where randomly the crawler would throw traces
about libcrypto. I reported the issue upstream and they released a snapshot
release two days later that seems to have fixed the issue. So upstream is
active.
    - I played with some styling ideas, and tried to incorporate search
results into the standard Fedorahosted/people/wiki template. Needs some
work to finish this, but it's getting there.
    - The default CGI template had horrid HTML, but I worked with that and
got it reasonable (going to finish it up today or tomorrow and try to get
it passing as valid html 5).
But out of the options I tried, this seems like the best one available. It
is a fork of mnoGoSearch. It has a lot of options to customize it, and
shape it into what we want it to do.
That said, I am more than open to trying other options before we decide to
move forward with Datapark. If nobody screams over the next few days, I
will work on moving forward. We need to package it, and it looks like we'll
have to package the snapshot version.
Anyway I am just throwing this out to update everyone on my findings, and
see if anyone has ideas for other options.
-re