For some time, we've wanted to explore search engine options, so that you could go to one site and search all of the Fedora sites that we run. Examples include: Wiki (wiki search is not that great), docs.fp.o, pkgdb, etc.
Relative information from last time this was discussed: https://fedorahosted.org/fedora-infrastructure/ticket/1055 http://fedoraproject.org/wiki/Infrastructure/Search
I've been playing with various options on one of our junkXX boxes, and seeing what works well.
- I tried Sphinx, but it seems this is really just a database fulltext search, not a full-out search engine and crawler solution. - I tried Xapian, but getting it crawling required a lot of hacking and conversion from an external crawler (e.g. htdig), and htdig kept throwing traces and dying, on https sites. - I tried mnoGoSearch, its CGI would not work at all. It would simply timeout when I tried to go to it.
- I lastly tried Datapark Search, which seems like our best bet: - I ran into an issue where randomly the crawler would throw traces about libcrypto. I reported the issue upstream and they released a snapshot release two days later that seems to have fixed the issue. So upstream is active. - I played with some styling ideas, and tried to incorporate search results into the standard Fedorahosted/people/wiki template. Needs some work to finish this, but it's getting there. - The default CGI template had horrid HTML, but I worked with that and got it reasonable (going to finish it up today or tomorrow and try to get it passing as valid html 5).
But out of the options I tried, this seems like the best one available. It is a fork of mnoGoSearch. It has a lot of options to customize it, and shape it into what we want it to do.
That said, I am more than open to trying other options before we decide to move forward with Datapark. If nobody screams over the next few days, I will work on moving forward. We need to package it, and it looks like we'll have to package the snapshot version.
Anyway I am just throwing this out to update everyone on my findings, and see if anyone has ideas for other options.
-re
...snip...
I'm personally happy moving ahead with Datapark. The rest of them fail for various (sometimes multiple) reasons. ;)
If it turns out that it doesn't work too well or otherwise sucks we can just retire it. I think it's well worth persuing for now.
Once we have a package, it should be easy to setup a instance and get a full crawl done and see how well it works out.
Steps I see:
- Package it and get that reviewed (I'm happy to review).
- Setup test instance
- Identify all the resources we want it to crawl and crawl them. (will need to adjust threads and such here, also may need to adjust robots.txt to allow our crawler to crawl more). Ideally after a full crawl, it can do checks pretty quickly.
- Adjust results * May need to look at tagging pages or resources so they are better described. * May need to fix it so csrf tokens aren't saved in results. * May need to teach it what LANG some things are and favor things from your current LANG. * May need to drop some results/sites out.
- Theme search page (Sounds like there's a good start/possibly done version already).
- Change search fields/add them * Change the wiki to call this. * possibly add search field to all apps?
- Profit
kevin
infrastructure@lists.fedoraproject.org