Trace: » installing_dataparksearch_engine_on_ubuntu_from_scratch

Login

You are currently not logged in! Enter your authentication credentials below to log in. You need to have cookies enabled to log in.

Login

How install dataparksearch onto Ubuntu Feisty (x86_64) LTS - command line only

I have not tried it under Ubuntu - but a simpler approach that I used under Debian: - Fetch a 4.49 snapshot of dpsearch. - untar the tarball. - go to the dpsearch-4.9-xxx directory - edit debian/rules and change the configure line to match the other options that you might want (e.g. change the –with-pgsql to –with-mysql if using myslq. - type “fakeroot dpkg-buildpackage -uc -us -sa” to build the debian package (it will be one directory above the current one and called dpsearc-xxx.de where xxx depends on the version etc.) - installe the debian as “sudo dpkg -i dpsearch-xxxx.deb” - the configuration files will be in /usr/etc/dpsearch, var directory in /usr/var/lib/dpsearch etc. - If you are running a local repository then it can be added to using the “dpsearch-xxx.changes” file. Keeps it nicely in the debian package management system.

Will improve this later to automatically build the version for each db.

Amit

what about - aptitude install dataparksearch?

  1. sudo su (easier to do now… !)
  2. apt-get update
  3. apt-get install nano
  4. add new debs (see seperate doc) to /etc/apt/sources.list (use nano /etc/apt/sources.list - make changes - ctrl-O to save - ctrl-X to exit)
  5. apt-get update
  6. apt-get install make
  7. apt-get install apache2
  8. apt-get install php5
  9. apt-get install libapache2-mod-perl2
  10. apt-get install libapache2-mod-perl2-dev
  11. apt-get install zlib1g-dev
  12. apt-get install zlib1g (might already be installed)
  13. apt-get install mysql-server
  14. apt-get install libmysqlclient15-dev (to get mysql.h)
  15. cd to toplevel
  16. mkdir downloaded_software
  17. tar -zxf dpsearch-x.x.tar.gz
  18. mysqladmin create search
  19. cd dpsearch-x.x
  20. apt-get install gcc
  21. apt-get install aspell
  22. apt-get install aspell-en (english dic)
  23. ./install.pl
  24. answer questions! - problem with aspell at the mo… fails everytime
  25. make
  26. make install (run as root!)
  27. cd /usr/local/dpsearch/etc
  28. cp indexer.conf-dist indexer.conf
  29. nano indexer.conf
  30. change DBAddr - make sure its correct… if you havent touched mysql you can use mysql:root@localhost…blah - ctrl-O to save - ctrl-X to exit - cp langmap.conf-dist langmap.conf - cp search.htm-dist search.htm - cp stopwords.conf-dist stopwords.conf - cp sections.conf-dist sections.conf - nano sections.conf and remove # from lines you wish to use, this is used by spider to add wait to certain bits. - nano search.htm and make DBAddr the same as you put in indexer.conf, you can also edit any of the html if you wish (scroll down the file to find it.. ) - nano stopwords.conf - scroll down to the line “StopwordFile stopwords/ja.sl” and place a # at the start of that line.. - /usr/local/dpsearch/sbin/indexer -Ecreate - it SHOULD bring back something like “blah, blah, blah - 42 queries sent, 42 succeeded, 0 failed” - if so :) - mkdir /var/www/cgi-bin - chomd 777 /var/www/cgi-bin (again just to make it work) - cp /usr/local/dpsearch/bin/search.cgi /var/www/cgi-bin/search.cgi - chmod 777 /var/www/search.cgi (just to make sure ;) - cd /etc/apache2/sites-available/ - nano default - find and change the following code: ScriptAlias /cgi-bin/ /usr/lib/cgi-bin/ <Directory ”/usr/lib/cgi-bin”> AllowOverride None Options ExecCGI -MultiViews +SymLinksIfOwnerMatch Order allow,deny Allow from all </Directory> to this code: ScriptAlias /cgi-bin/ /var/www/cgi-bin/ <Directory ”/var/www/cgi-bin”> AllowOverride None Options ExecCGI -MultiViews +SymLinksIfOwnerMatch Order allow,deny Allow from all </Directory> * save and exit nano * /etc/init.d/apache2 restart * fire up a web browser and point to http:<your host>/cgi-bin/search.cgi
  • if you see a normal (but simple) web page with a search box.. then well done! thats the hardest part over :o)
  • if you see a load of gobbeldy gook then go back over the bit about /etc/apache2/sites-available/default bit again.. it means the cgi is getting presented as plain text and not being executed as a perl binary.. check the permissions on both the file and dir..

now we have to bang in some URLs to search and off we go.. to do this

  1. cd /usr/local/dpsearch/etc/
  2. nano urls (make a file called urls to hold info about the urls…)

Then you need to add a URL and tell the search engine how to search it.. a simple one could be something like this: (Copy and paste the following code if you wish..)

     #############
     #
     # Simple URL List
     #
     #############
     #
     # scan the C4 news website index page only
     Period 7d # scan again in 7 days.
     Server page http://www.channel4.com/news/
     #
     #
     ######################
  1. Ctrl-o to save and then ctrl-x to quit nano…
  2. Now you have to connect the url file to the indexer.conf file by adding a link to the urls file.. first scroll down the indexer.conf file and find the section on Server - something like this:
     ##########################
     #Server [Method] [Subsection..........
     # This is the main command..........
     #to describe web-space.......
     #..........
     #..........
     #..........
     ############

just add at the end of that section the following line:

    Include urls

(note there is no leading hash..!)

  • ctrl-o then ctrl-x to save and exit nano
  • next you need to index the site..

in the shell type

      /usr/local/dpsearch/etc/sbin/indexer

This will show you in real time the page being scanned.. and should return some information once complete.

  • Pop along to your web front end (that cgi-bin/search.cgi page) and try typing in something off the CH4 news website, hopefully it will return some results.
  • Read some of the documents in /docs/samples folder (in the datapark folder) and start by changing some of the elements within the IRL
  • Take a look at http://www.dataparksearch.org/dpsearch-indexcmd.en.html for more info on URL settings.. read the bit about urls in the instructions provided by dpsearch (see /usr/local/dpsearch/docs/ for more info)

Hopefull this has been useful in some way, I use it for specific searchengines and because i like the technology around search engines. Another search engine i have used and dabbled with is phpdig.net, and have managed to get it to index around 300 websites (a huge index for a php based system) with the average return time for a search request of around 5seconds.

The dataparksearch engine is much quicker…

Next I will be looking at the java based search engine Nutch.. http://en.wikipedia.org/wiki/Nutch

Adam