Developing a scraper server with Python and ElasticSearch
Notice that scraping websites of others is not a legitimate action so this article is for the sake of the development challenge only.
The server should have the following RESTful services:
- Scrape a public webpage, given its URL, and store the result in structured manner in a persistent layer.
- Search for pages by a property and value.
- Search for top scored pages. The score should be calculated from the fields of the page (trivial algorithm or complex)
Furthermore, scalable architecture for high-volume should be implemented.
Looking for a scraper
Scraping websites is a time consuming job, because websites of course don’t want you to “abuse” them with a server and will block your IP upon excessive use.
I used a Selenium, commonly used for web testing, to get all the fields of the page. Selenium is slow, heavy and not very scalable (unless you are also using Selenium grid), but assuming page adding will be fairly small amount of times it is ok. The architecture and the chosen datastore is not dependable with the scraper, so easily another Python scraper can be integrated.
For the novice Python developers, here is detailed steps to use scraper:
Download and install Python 2.7, PyCharm (or another IDE/text editor), PhantomJS, and with command line use pip (package manager of Python default installed when installing Python) to install 3 necessary packages which I will later explain:
pip install selenium flask elasticsearch
After feeding the scraper with a URL you’ll get a JSON looking somewhat like this:
“TITLE”: “Some title..”,
“DESCRIPTION”: “Software projects architect...”,
“DATE”: “July 2012 \u2013 Present (4 years)”,
“TITLE”: “Software Engineer”,
“DATE”: “April 2011 \u2013 July 2012 (1 year 4 months)”,
REST APIs with Python
So now to make the scraping tool a RESTful server. I recommend on Flask. It is very easy to use:
from flask import Flask
app = Flask(__name__)# Publish a REST service accessible from "localhost:5000/"
return "Hello World!"
if __name__ == "__main__":
Flask is relatively new- as of this date of publishing this post it is on 0.11 version. You can read about its design here: http://flask.pocoo.org/docs/0.11/design/#design
We want to index a high-volume of JSON documents, to be able to query them on multiple fields and to score them by a mechanism. Without considering too much options, Elasticsearch fits like a glove. Elasticsearch is a document store based on JSON and have dynamic mapping which can just get a sample JSON Linkedin profile and will store it. Saving all the hassle of designing a relational schema and breaking down a JSON to tables and foreign keys.
One small change is needed to make sure later that the scoring mechanism will work: Elasticsearch understands the numbers in the JSON as strings so changing the mapping of the number fields is a good idea.
For those of you unfamiliar with Elasticsearch, just read the docs step-by-step from their site. It is dead simple to get it up and running. You just download it and run it. That’s it! For scalability and sharding of course making it work operational on a cluster would demand some more work and knowledge, but its usability is truly amazing. I recommend to also download and run Kibana to have a good web GUI to work with Elasticsearch. Use Kibana’s Sense, with the simple cmd command:
bin\kibana plugin --install elastic/sense
Sense is a web tool accessible with: http://localhost:5601/app/sense
Now run in Sense the mapping PUT found in: https://github.com/look4regev/Linkedin-Scraper-Server/blob/master/ElasticSearch-mapping-init.json
Scoring the profiles
The first thought that came into my mind when thinking how to implement this was of course the straight forward “brake the JSON profile to an OOP class, then calculate, then store the score as an additional field of the mapping in Elasticsearch”. It is hard to implement and maintain, recalculating all the documents upon each change of the algorithm. The answer is to use the wonder of Elasticsearch: Scoring function!
Elasticsearch has the ability to let you interfere with the scoring of the resulted query. It is very mature and versatile module in elasticsearch. The function scoring is still very fast due to the virtues of the index. The following query averages the score of a string search for an academic degree with the number of recommendations:
'EDUCATION.NAME': 'degree university college academy'
There is a lot more to learn about the scoring options on Elasticsearch. Additional ideas for scoring can be also the use of Boosting query to lower the scoring of the profile upon unprofessional signs (simple example would be a person’s name not in Camel Case).
Software is an art :)