**How install dataparksearch onto Ubuntu Feisty (x86_64) LTS - command line only**
I have not tried it under Ubuntu - but a simpler approach that I used under Debian:
- Fetch a 4.49 snapshot of dpsearch.
- untar the tarball.
- go to the dpsearch-4.9-xxx directory
- edit debian/rules and change the configure line to match the other options that you might want (e.g. change the --with-pgsql to --with-mysql if using myslq.
- type "fakeroot dpkg-buildpackage -uc -us -sa" to build the debian package (it will be one directory above the current one and called dpsearc-xxx.de where xxx depends on the version etc.)
- installe the debian as "sudo dpkg -i dpsearch-xxxx.deb"
- the configuration files will be in /usr/etc/dpsearch, var directory in /usr/var/lib/dpsearch etc.
- If you are running a local repository then it can be added to using the "dpsearch-xxx.changes" file.
Keeps it nicely in the debian package management system.
Will improve this later to automatically build the version for each db.
Amit
what about - aptitude install dataparksearch?
- sudo su (easier to do now... !)
- apt-get update
- apt-get install nano
- add new debs (see seperate doc) to /etc/apt/sources.list (use nano /etc/apt/sources.list - make changes - ctrl-O to save - ctrl-X to exit)
- apt-get update
- apt-get install make
- apt-get install apache2
- apt-get install php5
- apt-get install libapache2-mod-perl2
- apt-get install libapache2-mod-perl2-dev
- apt-get install zlib1g-dev
- apt-get install zlib1g (might already be installed)
- apt-get install mysql-server
- apt-get install libmysqlclient15-dev (to get mysql.h)
- cd to toplevel
- mkdir downloaded_software
- wget http://www.dataparksearch.org/dpsearch-4.46.tar.gz
-tar -zxf dpsearch-x.x.tar.gz
- mysqladmin create search
- cd dpsearch-x.x
- apt-get install gcc
- apt-get install aspell
- apt-get install aspell-en (english dic)
- ./install.pl
- answer questions! - problem with aspell at the mo... fails everytime
- make
- make install (run as root!)
- cd /usr/local/dpsearch/etc
- cp indexer.conf-dist indexer.conf
- nano indexer.conf
- change DBAddr - make sure its correct... if you havent touched mysql you can use mysql://root@localhost...blah
- ctrl-O to save
- ctrl-X to exit
- cp langmap.conf-dist langmap.conf
- cp search.htm-dist search.htm
- cp stopwords.conf-dist stopwords.conf
- cp sections.conf-dist sections.conf
- nano sections.conf and remove # from lines you wish to use, this is used by spider to add wait to certain bits.
- nano search.htm and make DBAddr the same as you put in indexer.conf, you can also edit any of the html if you wish (scroll down the file to find it.. )
- nano stopwords.conf - scroll down to the line "StopwordFile stopwords/ja.sl" and place a # at the start of that line..
- /usr/local/dpsearch/sbin/indexer -Ecreate - it SHOULD bring back something like "blah, blah, blah - 42 queries sent, 42 succeeded, 0 failed" - if so :)
- mkdir /var/www/cgi-bin
- chomd 777 /var/www/cgi-bin (again just to make it work)
- cp /usr/local/dpsearch/bin/search.cgi /var/www/cgi-bin/search.cgi
- chmod 777 /var/www/search.cgi (just to make sure ;)
- cd /etc/apache2/sites-available/
- nano default
- find and change the following code:
ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/
AllowOverride None
Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
Order allow,deny
Allow from all
to this code:
ScriptAlias /cgi-bin/ /var/www/cgi-bin/
AllowOverride None
Options ExecCGI -MultiViews +SymLinksIfOwnerMatch
Order allow,deny
Allow from all
* save and exit nano
* /etc/init.d/apache2 restart
* fire up a web browser and point to http:///cgi-bin/search.cgi
* if you see a normal (but simple) web page with a search box.. then well done! thats the hardest part over :o)
* if you see a load of gobbeldy gook then go back over the bit about /etc/apache2/sites-available/default bit again.. it means the cgi is getting presented as plain text and not being executed as a perl binary.. check the permissions on both the file and dir..
now we have to bang in some URLs to search and off we go..
to do this
- cd /usr/local/dpsearch/etc/
- nano urls (make a file called urls to hold info about the urls...)
Then you need to add a URL and tell the search engine how to search it..
a simple one could be something like this:
(Copy and paste the following code if you wish..)
#############
#
# Simple URL List
#
#############
#
# scan the C4 news website index page only
Period 7d # scan again in 7 days.
Server page http://www.channel4.com/news/
#
#
######################
- Ctrl-o to save and then ctrl-x to quit nano...
- Now you have to connect the url file to the indexer.conf file by adding a link to the urls file.. first scroll down the indexer.conf file and find the section on Server - something like this:
##########################
#Server [Method] [Subsection..........
# This is the main command..........
#to describe web-space.......
#..........
#..........
#..........
############
just add at the end of that section the following line:
Include urls
(note there is no leading hash..!)
* ctrl-o then ctrl-x to save and exit nano
* next you need to index the site..
in the shell type
/usr/local/dpsearch/etc/sbin/indexer
This will show you in real time the page being scanned.. and should return some information once complete.
* Pop along to your web front end (that cgi-bin/search.cgi page) and try typing in something off the CH4 news website, hopefully it will return some results.
* Read some of the documents in /docs/samples folder (in the datapark folder) and start by changing some of the elements within the IRL
* Take a look at http://www.dataparksearch.org/dpsearch-indexcmd.en.html for more info on URL settings.. read the bit about urls in the instructions provided by dpsearch (see /usr/local/dpsearch/docs/ for more info)
Hopefull this has been useful in some way, I use it for specific searchengines and because i like the technology around search engines.
Another search engine i have used and dabbled with is phpdig.net, and have managed to get it to index around 300 websites (a huge index for a php based system) with the average return time for a search request of around 5seconds.
The dataparksearch engine is much quicker...
Next I will be looking at the java based search engine Nutch..
http://en.wikipedia.org/wiki/Nutch
Adam