Hello,
Today desktop search tools are killer-apps, the hype about Beagle or Spotlight is really big. I think that Fedora needs a search tool to be competitive.
As you might know Beagle uses a C# port of lucene (http://lucene.apache.org/). Lucene is a open source search engine written in java. It runs with gcj but was also ported to other languages like C++ or Python. There is allready a search tool based on lucene called regain (http://regain.sourceforge.net/). But I think it doesn't run with a free java implementation, also there is no live-indexing and no application integration, for example with evolution.
It shouldn't be to hard to wright a tool based on lucene or a port of it. The hard things are live indexing (gamin??) and application integration.
It would be great to see a desktop search tool in Fedora.
Marco Meyerhofer marco_meyerhofer@freesurf.ch wrote:
Today desktop search tools are killer-apps, the hype about Beagle or Spotlight is really big. I think that Fedora needs a search tool to be competitive.
If you say so...
[...]
It shouldn't be to hard to wright a tool based on lucene or a port of it. The hard things are live indexing (gamin??) and application integration.
It would be great to see a desktop search tool in Fedora.
Congratulations! You've just found your very own open source project to work on.
On Sun, 2005-06-26 at 16:52 -0400, Horst von Brand wrote:
Marco Meyerhofer marco_meyerhofer@freesurf.ch wrote:
Today desktop search tools are killer-apps, the hype about Beagle or Spotlight is really big. I think that Fedora needs a search tool to be competitive.
They (meaning engineers at redhat) are discussing this. The solution won't use Lucene, as Lucene treats all fine content as equal - ie, it doesn't know about headings being different from body text and so on.
Mike
Mike MacCana wrote:
. They (meaning engineers at redhat) are discussing this. The solution won't use Lucene, as Lucene treats all fine content as equal - ie, it doesn't know about headings being different from body text and so on.
Mike
Also, Lucene suffers from the Java UCS-16 scandal: they chose a character encoding which is good for Japanese, but bulks up european languages by a factor of two and doesn't support enough characters to do a good job with Chinese.
Because of this, Lucene loses a factor of two in performance compared to C++ competitors such as Xapian, which is a minus for those who care about performance on computers that aren't monster servers with 8 megs of RAM and Ultra 320 disks. (Funny enough, we're not all that happy with Lucene performance on such a machine... But we've got a lot of text...)
On Mon, Jun 27, 2005 at 03:58:29PM +1000, Mike MacCana wrote:
They (meaning engineers at redhat) are discussing this. The solution won't use Lucene, as Lucene treats all fine content as equal - ie, it doesn't know about headings being different from body text and so on.
One possibility is the muscat engine - thats open source could probably do the job well. It's a little weak on revoking content from the index without a rebuild.
On Tue, 2005-06-28 at 12:18 -0400, Alan Cox wrote:
On Mon, Jun 27, 2005 at 03:58:29PM +1000, Mike MacCana wrote:
They (meaning engineers at redhat) are discussing this. The solution won't use Lucene, as Lucene treats all fine content as equal - ie, it doesn't know about headings being different from body text and so on.
One possibility is the muscat engine - thats open source could probably do the job well. It's a little weak on revoking content from the index without a rebuild.
I think it's the Xapian engine you're talking about? http://xapian.org/ GPL, written in C++. Apparently it's being tried out for use as the Gmane search engine.
/Per
Not that it's terribly fast, but it is pretty easy to use and it's written in Python:
http://svn.zope.org/Zope3/trunk/src/zope/app/catalog/
The indexing code that this employs has been in use within Zope for many years.
Interesting code examples include:
http://svn.zope.org/Zope3/trunk/src/zope/index/text/textindex.txt?rev=28610&...
http://svn.zope.org/Zope3/trunk/src/zope/index/text/tests/mhindex.py?rev=297...
- C
On Tue, 2005-06-28 at 18:29 -0700, Per Bjornsson wrote:
On Tue, 2005-06-28 at 12:18 -0400, Alan Cox wrote:
On Mon, Jun 27, 2005 at 03:58:29PM +1000, Mike MacCana wrote:
They (meaning engineers at redhat) are discussing this. The solution won't use Lucene, as Lucene treats all fine content as equal - ie, it doesn't know about headings being different from body text and so on.
One possibility is the muscat engine - thats open source could probably do the job well. It's a little weak on revoking content from the index without a rebuild.
I think it's the Xapian engine you're talking about? http://xapian.org/ GPL, written in C++. Apparently it's being tried out for use as the Gmane search engine.
/Per
On Tue, Jun 28, 2005 at 06:29:03PM -0700, Per Bjornsson wrote:
I think it's the Xapian engine you're talking about?
Yes
http://xapian.org/ GPL, written in C++. Apparently it's being tried out for use as the Gmane search engine.
Cool
Alan Cox <alan <at> redhat.com> writes:
One possibility is the muscat engine - thats open source could probably do the job well. It's a little weak on revoking content from the index without a rebuild.
Can you elaborate on "weak"? Nobody's mentioned this to me (the main Xapian developer), and it's really hard to address problems I don't know about...
Xapian's certainly better than some - for example swish-e can only remove documents by rebuilding the index (unless you build it with the experimental --enable-incremental option). Not meaning to pick on swish-e particularly - it was just the first example to come to mind.
Cheers, Olly
On Sat, Jul 02, 2005 at 12:57:18AM +0000, Olly Betts wrote:
Xapian's certainly better than some - for example swish-e can only remove documents by rebuilding the index (unless you build it with the experimental --enable-incremental option). Not meaning to pick on swish-e particularly - it was just the first example to come to mind.
Xapian didnt seem to be returning disk space without a rebuild. Not sure if that is an index property or not ?
Alan Cox alan@redhat.com writes:
Xapian didnt seem to be returning disk space without a rebuild. Not sure if that is an index property or not ?
Ah yes, that's a feature of the current Btree manager. The disk space isn't "leaked", so if you add more documents it'll get reused, but even if you delete all the documents the index size won't decrease!
However, a full rebuild isn't required to recover the space. Instead you can run the index through "quartzcompact" which will reduce it to minimal size. Because "quartzcompact" works on the inverted file structure, it's much faster than a full rebuild would be (it also avoids having to reread and reparse all the documents). For example, the approx. 28 million document Gmane index takes about 45 minutes to compact. Rebuilding that takes more like 45 *hours*.
I'm currently working on a new-and-improved backend, having learned a lot from watching and tinkering with the current one. Currently it's using the same Btree manager, but I'm planning to replace that and I'm intending to allow the file size to shrink in the new one.
Cheers, Olly
On 6/26/05, Marco Meyerhofer marco_meyerhofer@freesurf.ch wrote:
It would be great to see a desktop search tool in Fedora.
You might want to touch base with Mr. Tromey about this issue who has commented on this in his blog: http://www.peakpeak.com/~tromey/blog/2005/06/22#regain
-jef" as seen on http://fedora.linux.duke.edu/fedorapeople/ "spaleta
Seems to me the reason why gcj is a problem is the UI and not the underlying engine. It amazes me how much effort is wasted in producing substandard UIs to otherwise excellent software. I'm biased because of what I'm used to (and not used to), but IMHO developing web based UIs is much, much easier than the same thing in C++ or Swing.
Traditionally web based UIs were limited, but nowadays most of the limitations have been removed thanks to CSS styling and 'AJAX' techniques (example: gmail).
How about writing this software as a process serving HTTP requests to a local web browser?
Joe.
On 6/26/05, Jeff Spaleta jspaleta@gmail.com wrote:
On 6/26/05, Marco Meyerhofer marco_meyerhofer@freesurf.ch wrote:
It would be great to see a desktop search tool in Fedora.
You might want to touch base with Mr. Tromey about this issue who has commented on this in his blog: http://www.peakpeak.com/~tromey/blog/2005/06/22#regain
-jef" as seen on http://fedora.linux.duke.edu/fedorapeople/ "spaleta
-- fedora-devel-list mailing list fedora-devel-list@redhat.com http://www.redhat.com/mailman/listinfo/fedora-devel-list
Joe Desbonnet wrote:
Traditionally web based UIs were limited, but nowadays most of the limitations have been removed thanks to CSS styling and 'AJAX' techniques (example: gmail).
Yeah, until you actually try it.
When I do javascript projects I spend about 30% of the time getting the app working and then the other 70% worrying about browser compatibility and workarounds for funkiness in the browser. For instance, if you're doing a drag-and-drop interface in Mozilla, you don't get notification when the cursor leaves the browser window, and mouse move events don't tell you if the buttons are down, so you can't really 'do the right thing' in these cases. The best thing I've figured is to extrapolate the motion of the mouse, and assume that the mouse went out of the window if it was heading for the edge of the window and we don't see any mouse events after a time delay.
Like GUI applications, it's easy to make an AJAX application work 80% of the time (like the GUI crapplets that come with Fedora) but getting right behavior the rest of the time takes a big investment of time and energy, something most open source authors, never mind commercial entities, aren't willing to do.