Crawling / screen scraping
Jeg skal lave et jobsite, hvor vi crawler jobs fra en stripe sites, lige som www.jobindex.dk gør.Jeg søger derfor en programmør, som kan gøre dette.
Lidt på engelsk om dette:
We wanna crawl websites for job ads, showing them on our web/database. Our web/database is based on a Postgreskr database, but as we often have changes in this, it might be we simply wanna place the crawled data into another new database, and when let our system/database extract data from it. Simple.
Some of the websites we wanna crawl:
- eures.dk (http://ec.europa.eu/eures/main.jsp?lang=da&acro=job&catId=482&parentCategory=482)
- jobnet.dk
- etc.
What we wanna scrape:
- Provider name and logo (e.g. Manpower)
- Job title
- Body text/job description
- Location
- Sector
- Date of publicity
, so its going to look like others doing so, like www.jobindex, www.simplyhired.com etc.
Requirements:
- It shall be possible to hide some of the information in our database. E.g. if we wanna hide provider and contact information, and put our contact information instead.
- It shall not be possible for the web we are crawling, to track this (hide/change IP address)
- The information being crawled shall be updated at least every second day.
- No doublet jobs/data – meaning we don’t wanna crawl jobs from e.g. 2 different websites, having the same job, so the system shall be able to track and remove doublet data/jobs.
- No errors (?; o)
Wishes:
- Translation of job ads in different languages (e.g. using http://www.google.com/translate_t API http://googlified.com/2006unofficial-google-translate-api/)
- Possible search agent, letting users/jobseekers create a job agent, sending them relevant jobs as they will appear in the database.
Possible software:
- E.g. using http://www.screen-scraper.com/, standard software, paying the programmer for each robot (crawler made crawling a website). E.g. www.botcode.com in India (who can make robots for 60 USD/robot, using standard software) or a Danish programmer.
- many other standard software like http://lucene.apache.org/nutch/about.html, http://www.velocityscape.com/Products/WebScraperLite.aspx etc. could be used.
Sig venligt til, hvis du er den rette, eller kan henvise til den rette.
Mvh
Simon