Page 1 of 1
Google for your hard disk?
Posted: April 28th, 2004, 3:59 pm
by mikebell
I've come to an interesting problem... I have around 17GB of pdf, word, excel, html, ppt files on my HD and even though they are somewhat neatly organized in around 30 folders, I still have a hard time finding what I want. I generally save anything I find/read interesting since bookmarking is highly unreliable. Sites and articles tend to disappear into "for pay" archives sections or disappear completely. Unless you have a subscription to Lexis-Nexis, it's pretty much pointless to search spam filled Google for articles. Search implemented in today's OS file browsers searches filenames/directories only. OS X's Finder has Indexing built in but it doesn't handle every filetype (no PPT support) and it's fairly inefficient. I'm wondering if there's a (preferably free and preferably *nix) tool that can index all these files (go into PDF, PPT etc files and index all of the relevant text) so that you can quickly search for strings?I came across this: X1 but it's for Windows only and it's still in beta stage.
Google for your hard disk?
Posted: April 28th, 2004, 8:09 pm
by linuxuser99
There are PERL modules for reading virtually anything - and implementing "grep" is easy once you have a text stream to work with (in any event most file formats dont encrypt text inside the file they just pad it with crap) so would be easy enough to write your own. Alternatively use "strings" rip out the readable text and then pipe this into a dbm file on a word by word basis and search on that. Would be circa an hours work to write a search util for that.
Google for your hard disk?
Posted: April 28th, 2004, 8:21 pm
by mikebell
QuoteOriginally posted by: linuxuser99There are PERL modules for reading virtually anything - and implementing "grep" is easy once you have a text stream to work with (in any event most file formats dont encrypt text inside the file they just pad it with crap) so would be easy enough to write your own. Alternatively use "strings" rip out the readable text and then pipe this into a dbm file on a word by word basis and search on that. Would be circa an hours work to write a search util for that.Thanks for a tip. However, it's not just an hour of work. Writing a good/efficient/fast clustering and hashing algo to search this index is not a trivial task. I have 17Gig of data and a bad index could end up being 10+ gigs.
Google for your hard disk?
Posted: April 28th, 2004, 8:52 pm
by linuxuser99
QuoteOriginally posted by: mikebellQuoteOriginally posted by: linuxuser99There are PERL modules for reading virtually anything - and implementing "grep" is easy once you have a text stream to work with (in any event most file formats dont encrypt text inside the file they just pad it with crap) so would be easy enough to write your own. Alternatively use "strings" rip out the readable text and then pipe this into a dbm file on a word by word basis and search on that. Would be circa an hours work to write a search util for that.Thanks for a tip. However, it's not just an hour of work. Writing a good/efficient/fast clustering and hashing algo to search this index is not a trivial task. I have 17Gig of data and a bad index could end up being 10+ gigs.Ah - now if you're going to add complications like being fast and small <g>I agree that using raw strings will end up with a huge file - but disk space is really really cheap now. A naive approach (store each unique word in a hash with a comma separated list of file names it is in next to it) will be large for sure - but would be very fast and quick to develop. A more sophisticated approach of storing individual words in a hash with files specified as a comma separated list if individual numbers with an offset hash to de-reference the file names would save a LOT of space - especially if you ran a quick filter to pull out words like "a" "the" "and" but" that will be in all the files. I'll bet you're not going to be more than 50% bigger than a commercial product if you took that approach.You're right that getting that last 50% though would be a bitch!
Google for your hard disk?
Posted: April 28th, 2004, 9:37 pm
by mikebell
Agreed. Have you looked at this?
http://jakarta.apache.org/lucene/docs/index.htmlI think that coupling Lucene with perl filters would be a way to go. Also, there must be something like this out there... no way that I encountered this first. I'd hate to duplicate work especially when my free time is so short.
Google for your hard disk?
Posted: April 28th, 2004, 9:59 pm
by linuxuser99
Yes - that does look good. There are still versions of the old Alta Vista personal search engine out there if you are willing to run something on your PC pointing at a unix drive. That was more or less exactly what you were looking for but sold like a dog and got discontinued couple of years back.
Google for your hard disk?
Posted: April 29th, 2004, 5:29 am
by tonyc
QuoteOriginally posted by: linuxuser99Yes - that does look good. There are still versions of the old Alta Vista personal search engine out there if you are willing to run something on your PC pointing at a unix drive. That was more or less exactly what you were looking for but sold like a dog and got discontinued couple of years back.tried to find it on the altavista web site, but only found refrences to a $500,000 "enetrprise" solution . . . . . . can you point me towards the workstation solution?
Google for your hard disk?
Posted: April 29th, 2004, 12:09 pm
by Baltazar
One particular solution could be to set up a personal web server with all these documents and ask google to index it (i'm not sure how to do that but i think it's possible).pro: google stores and build the indexcon: you got to publish these documents long enough for google to indexprobably an untractable solution B.
Google for your hard disk?
Posted: April 29th, 2004, 2:22 pm
by zer0snr
How about .....Reference Manager
Google for your hard disk?
Posted: April 29th, 2004, 8:20 pm
by mikebell
QuoteOriginally posted by: BaltazarOne particular solution could be to set up a personal web server with all these documents and ask google to index it (i'm not sure how to do that but i think it's possible).pro: google stores and build the indexcon: you got to publish these documents long enough for google to indexprobably an untractable solution B.huge con: everyone can then search for my private docs. Even if I remove them, they're still cached by google. That's simply unacceptable. I know that google sells an appliance for Intranets but those cost thousands of dollars.
Google for your hard disk?
Posted: May 1st, 2004, 11:18 am
by linuxuser99
http://aroundcny.com/technofile/texts/bit071298.htmlI would so run a virus check on this before I ran it - but here it is.Sorry for the late reply - work intervened with my leasure time.
Google for your hard disk?
Posted: May 9th, 2004, 11:22 am
by kristj