Content indexing of uploaded files

Discuss about Group-Office development here

Moderator: Developers

hansvd
Posts: 24
Joined: Fri Oct 03, 2008 10:17 am

Content indexing of uploaded files

Postby hansvd » Tue Dec 09, 2008 8:19 pm

I'd like to contribute to the development of a simple "deep search" in uploaded files.

The concept is fairly simple:
- convert the content of binary files using catdoc, pdftotext, xls2csv and others
- load this ascii text in a database table, storing ACLs etc along with them -- similar to the cache_search_result() currently being used for comments, filename and filetype
- have the search function also go over this table when performing a search

Anyone willing to help?

Ideas?
mschering
Site Admin
Site Admin
Posts: 8332
Joined: Tue Apr 20, 2004 1:06 pm
Location: The Netherlands - Den Bosch
Contact:

Re: Content indexing of uploaded files

Postby mschering » Wed Dec 10, 2008 7:49 am

Yes, I would implement a file search separate from the Group-Office global search function for performance reasons. The index will have to have very much data in it. You can also customize the search function more suitable for files like searching on file types and dates too.
Best regards,

Merijn Schering
Intermesh
mschering
Site Admin
Site Admin
Posts: 8332
Joined: Tue Apr 20, 2004 1:06 pm
Location: The Netherlands - Den Bosch
Contact:

Re: Content indexing of uploaded files

Postby mschering » Fri Jan 02, 2009 9:20 am

Hi Hans,

I would create a folder structure like this:

modules/filesearch/
modules/filesearch/classes/
modules/filesearch/bin

Then instead of indexer.class.inc.php call it modules/filesearch/filesearch.class.inc.php because then it's picked up by GO for event handling if this is ever needed. Then it would be nice if it indexed a file at the point where a file is uploaded. But it should also be possible to reindex an entire folder structure.

I would also advice to include the file timestamps in the index table so you can search on those times later too.
Best regards,

Merijn Schering
Intermesh
hansvd
Posts: 24
Joined: Fri Oct 03, 2008 10:17 am

Re: Content indexing of uploaded files

Postby hansvd » Fri Jan 02, 2009 10:36 am

The current version is completely integrated in the files module. A file is indexed upon upload and update, the index is deleted when the file is deleted. I have also made a simple (non-ajax) search page which I call from a hyperlink in an announcement.

Which time stamps should be stored: file stamp of the uploaded file or the moment at which the indexer has indexed its content?

At this moment, I am finalizing debugging the Linux part of the indexer and once this seems to be running, I plan to install this patched version of go on a public server. But if Intermesh prefers to host this test version, it is OK for me as well.
mschering
Site Admin
Site Admin
Posts: 8332
Joined: Tue Apr 20, 2004 1:06 pm
Location: The Netherlands - Den Bosch
Contact:

Re: Content indexing of uploaded files

Postby mschering » Mon Jan 05, 2009 9:34 am

How did you integrate it in the files module? Can you post the code?
Best regards,

Merijn Schering
Intermesh
hansvd
Posts: 24
Joined: Fri Oct 03, 2008 10:17 am

Re: Content indexing of uploaded files

Postby hansvd » Mon Jan 05, 2009 10:16 am

Over the weekend, I have reorganized the code as per your suggestion. I also did some cleanup of odd comments and variable naming. The Windows version is running fine, but I get errors on Linux. I assume it has to do with uppercase/lowercase stuff or with access rights. I plan to send you the code as soon as the bugs are straightened out -- or when I am desparate. :P
hansvd
Posts: 24
Joined: Fri Oct 03, 2008 10:17 am

Re: Content indexing of uploaded files

Postby hansvd » Mon Jan 05, 2009 9:34 pm

I have sent the first useable version of the filesearch module to Merijn. I leave it up to him to decide how/if/when the module is released for testing by others.

Main features of the module is that the text content of the most popular binary formats (doc, docx, xls, xlsx, ppt, pptx, ods, sxc, odt, swx, odp, sxi, pdf, mp3, jpg) is extracted from files being oploaded and stored in the database. Mysql's full text indexer is taking care of indexing and serving search results.

This operation seems to be going very fast. On my development systems, it is not causing a noticeable lag during file upload, but stress tests could, of course, still reveal performance issues.

I am looking forward to your comments.
hansvd
Posts: 24
Joined: Fri Oct 03, 2008 10:17 am

Re: Content indexing of uploaded files

Postby hansvd » Sat Jan 31, 2009 7:58 pm

Tried to install the prototype in a hosted account and noticed that tools catdoc and xls2csv don't work any more at all. Both are not finding the charsets any more that they need to operate. The only solutions I found on the internet, was hardcoding some paths in the source code and recompile them. Of course, this is not an option for this module.

Anyone having experience with making catdoc and xls2csv work in a hosted environment?

Anyone having experience with other tools (available on Win32 and on Linux) for converting binary file formats into some readable text?
don
Posts: 149
Joined: Thu Mar 12, 2009 12:38 pm

Re: Content indexing of uploaded files

Postby don » Mon Mar 23, 2009 12:59 pm

Hi hansvd

Do you have in the meantime some new results?

Kind regards.

Don
Best regards,
Don
hansvd
Posts: 24
Joined: Fri Oct 03, 2008 10:17 am

Re: Content indexing of uploaded files

Postby hansvd » Mon Mar 23, 2009 5:42 pm

I haven't made any progress since my previous post. I have stopped development for the time being. Of course, I am glad to share my prototype code with anyone who would like to work on it.
don
Posts: 149
Joined: Thu Mar 12, 2009 12:38 pm

Re: Content indexing of uploaded files

Postby don » Wed Mar 25, 2009 7:11 am

My development skills on php are very poor. I just wanted to say, that I'm interested in this feature and that there will be a need for it. So, if you have code that works, I will implement it on my system.

Don
Best regards,
Don
calcorn
Posts: 110
Joined: Wed Mar 25, 2009 4:56 pm

Re: Content indexing of uploaded files

Postby calcorn » Mon Jul 27, 2009 7:23 pm

where can I download the prototype? And Merijn, is there plans on adding this to GO? This is a desperately needed feature.
Last edited by calcorn on Tue Jul 28, 2009 3:12 pm, edited 1 time in total.
mschering
Site Admin
Site Admin
Posts: 8332
Joined: Tue Apr 20, 2004 1:06 pm
Location: The Netherlands - Den Bosch
Contact:

Re: Content indexing of uploaded files

Postby mschering » Tue Jul 28, 2009 6:52 am

We will develop this in a couple of months.
Best regards,

Merijn Schering
Intermesh

Who is online

Users browsing this forum: No registered users and 1 guest

cron