The Call of the Web Scraper

Astrid, our Data Big Bang and Nektra content editor, is heading to Nepal on a birding and trekking quest. She needs birds sounds from xeno-canto and The Internet Bird Collection to identify the hundreds of species found in Nepal, but the site does not offer batch downloads. We could not pass up the opportunity to offer a useful scraper for birders. We found a blog post with code to download batches of recordings for specific species (not specific countries): Web Scraping with BeautifulSoup and Python. Like most script developers. we want to do things our own way. Our code allows simultaneous download of calls to speed up the process for specially diverse countries.

Web scraping is often associated with indecorous Internet behavior, but in fact, it is also a way to automate tedious manual work. Imagine that you want to have the complete schedule from EasyJet to choose a flight. It can take less than one hour to scrape all the desired routes. Right now there are no entry-level tools for scraping sites like there are for photo editing. Fortunately, script developers share their scraping code on sites like ScraperWiki.

If you liked this article, you might also like:

Web Scraping 101: Pulling Stories from Hacker News

This is a guest post by Hartley Brody, whose book “The Ultimate Guide to Web Scraping” goes into much more detail on web scraping best practices. You can follow him on Twitter, it’ll make his day! Thanks for contributing Hartley!

Hacker News is a treasure trove of information on the hacker zeitgeist. There are all sorts of cool things you could do with the information once you pull it, but first you need to scrape a copy for yourself.

Hacker News is actually a bit tricky to scrape since the site’s markup isn’t all that semantic — meaning the HTML elements and attributes don’t do a great job of explaining the content they contain. Everything on the HN homepage is in two tables, and there aren’t that many classes or ids to help us hone in on the particular HTML elements that hold stories. Instead, we’ll have to rely more on patterns and counting on elements as we go.

Pull up the web inspector in Chrome and try zooming up and down the DOM tree. You’ll see that the markup is pretty basic. There’s an outer table that’s basically just used to keep things centered (85% of the screen width) and then an inner table that holds the stories.

If you look inside the inner table, you’ll see that the rows come in groups of three: the first row in each group contains the headlines and story links, the second row contains the metadata about each story — like who posted it and how many points it has — and the third row is empty and adds a bit of padding between stories. This should be enough information for us to get started, so let’s dive into the code.

I’m going to try and avoid the religious tech wars and just say that I’m using Python and my trusty standby libraries — requests and BeautifulSoup — although there are many other great options out there. Feel free to use your HTTP requests library and HTML parsing library of choice.

In its purest form, web scraping is two simple steps: 1. Make a request to a website that generates HTML, and 2. Pull the content you want out of the HTML that’s returned.

As the programmer, all you need to do is a bit of pattern recognition to find the URLs to request and the DOM elements to parse, and then you can let your libraries do the heavy lifting. Our code will just glue the two functions together to pull out just what we need.

import requests

from BeautifulSoup import BeautifulSoup
# make a single request to the homepage
r = requests.get("https://news.ycombinator.com/")
# convert the plaintext HTML markup into a DOM-like structure that we can search
soup = BeautifulSoup(r.text)
# parse through the outer and inner tables, then find the rows
outer_table = soup.find("table")
inner_table = outer_table.findAll("table")[1]
rows = inner_table.findAll("tr")
stories = []
# create an empty list for holding stories
rows_per_story = 3
# helps us iterate over the table
for row_num in range(0, len(rows)-rows_per_story, rows_per_story):
	# grab the 1st & 2nd rows and create an array of their cells
	story_pieces = rows[row_num].findAll("td")
	meta_pieces = rows[row_num + 1].findAll("td")
	# create our story dictionary
	story = { "current_position": story_pieces[0].string, "link": story_pieces[2].find("a")["href"], "title": story_pieces[2].find("a").string, }
	try:
		story["posted_by"] = meta_pieces[1].findAll("a")[0].string
	except IndexError:
		continue # this is a job posting, not a story stories.append(story)

import json
print json.dumps(stories, indent=1)

You’ll notice that inside the for loop, when we’re iterating over the rows in the table two at a time, we’re parsing out the individual pieces of content (link, title, etc) by skipping to a particular number in the list of <td> elements returned. Generally, you want to avoid using magic numbers in your code, but without more semantic markup, this is what we’re left to work with.

This obviously makes the scraping code brittle, if the site is ever redesigned or the elements on the page move around at all, this code will no longer work as designed. But I’m guessing from the consistently minimalistic, retro look that HN isn’t getting a facelift any time soon. ;)

Extension Ideas

Running this script top-to-bottom will print out a list of all the current stories on HN. But if you really want to do something interesting, you’ll probably want to grab snapshots of the homepage and the newest page fairly regularly. Maybe even every minute.

There are a number of cool projects that have already built cool extensions and visualizations from (I presume) scraping data from Hacker News, such as:

http://hnrankings.info/
http://api.ihackernews.com/
https://www.hnsearch.com/

It’d be a good idea to set this up using crontab on your web server. Run crontab -e to pull up a vim editor and edit your machine’s cron jobs, and add a line that looks like this:

* * * * * python /path/to/hn_scraper.py

Then save it and exit (<esc> + “:wq”) and you should be good to go. Obviously, printing things to the command line doesn’t do you much good from a cron job, so you’ll probably want to change the script to write each snapshot of stories into your database of choice for later retrieval.

Basic Web Scraping Etiquette

If you’re going to be scraping any site regularly, it’s important to be a good web scraping citizen so that your script doesn’t ruin the experience for the rest of us… aw who are we kidding, you’ll definitely get blocked before your script causes any noticeable site degradation for other users on Hacker News. But still, it’s good to keep these things in mind whenever you’re making frequent scrapes on the same site.

Your HTTP Requests library probably lets you set headers like User Agent and Accept-Encoding. You should set your user agent to something that identifies you and provides some contact information in case any site admins want to get in touch.

You also want to ensure you’re asking for the gzipped version of the site, so that you’re not hogging bandwidth with uncompressed page requests. Use the Accept-Encoding request header to tell the server your client can accept gzipped responses. The Python requests library automagically unzips those gzipped responses for you.

You might want to modify line 4 above to look more like this:
headers = { "User-Agent": "HN Scraper / Contact me: ", "Accept-Encoding": "gzip", } r = requests.get("https://news.ycombinator.com/", headers=headers)
Note that if you were doing the scraping with some sort of headless browser or something like Selenium which actually downloads all the resources on the page and renders them, you’d also want to make sure you’re caching the stylesheet and images to avoid unnecessary extra requests.

If you liked this article, you might also like:

Scraping Web Sites which Dynamically Load Data
Ideas and Execution Magic Chart (includes a Hacker News Search Hack)
Running Your Own Anonymous Rotating Proxies

Scraping Web Sites which Dynamically Load Data

Preface

More and more sites are implementing dynamic updates of their contents. New items are added as the user scrolls down. Twitter is one of these sites. Twitter only displays a certain number of news items initially, loading additional ones on demand. How can sites with this behavior be scraped?

In the previous article we played with Google Chrome extensions to scrape a forum that depends on Javascript and XMLHttpRequest. Here we use the same technique for retrieving a specific number of news items based on a specific search. A list of additional alternatives is available in the Web Scraping Ajax and Javascript Sites article.

Code

Instructions

Download the code from github
Load the extension in Google Chrome: settings => extensions => check “developer mode” => load unpacked extension
An “eye” icon now appears on the Google Chrome bar
Go to the Twitter’s search page https://twitter.com/search-home and enter your search keywords
Now press the “eye” and then the start button
The scraping output is displayed on the console as JSON

Customization

To modify the number of news items to be scraped open the file inject.js and change the scrollBottom(100); line by the number of items you would like (e.g: scrollBottom(200);)

Acknowledgments

This source code was written by Matias Palomera from Nektra Advanced Computing.

If you like this article, you might also be interested in

Precise Scraping with Google Chrome

Developers often search the vast corpus of scraping tools for one that is capable of simulating a full browser. Their search is pointless. Full browsers with extension capabilities are great scraping tools. Among extensions, Google Chrome’s are by far the easiest to develop, while Mozilla has less restrictive APIs. Google offers a second way to control Chrome: the Debugger protocol. Unfortunately, Debugger protocol is pretty slow.

The Google Chrome extension API is an excellent choice for writing an up to date scraper which uses a full browser with the latest HTML5 features and performance improvements. In a previous article, we described how to scrape Microsoft TechNet App-V forum. Now, we will focus on VMWare’s ThinApp. In this case, we develop a Google extension instead of a Python script.

Procedure

You will need Google Chrome, Python 2.7, and lxml.html
Download the code from github
Install the Google Chrome extension
Enter the VMware ThinApp: Discussion Forum
The scraper starts automatically
Once it stops, go to the Google Chrome console and copy&paste the results in JSON format to the thinapp.json file
Run the thinapp_parser.py to generate the thinapp.csv file with the results
Open the thinapp.csv file with a spreadsheet
To rank the results, add a column which divides the number of views by the number of days.

Our Results: Top Twenty Threads

Registry Isolation…
Thinapp Internet Explorer 10
Process (ifrun60.exe) remains active (Taskmanager) after closing thinapp under windows7 (xp works)
Google Chrome browser
File association not passing file to thinapp package
Adobe CS3 Design Premium and FlexNET woes…
How to thinapp Office 2010?
Size limit of .dat file?
ThinApp Citrix Receiver 3.2
Visio 2010 Thinapp – Licensing issue
Thinapp Google Chrome
Thinapp IE7 running on Windows 7
Adobe CS 6
Failed to open, find, or create Sandbox directory
Microsoft Project and Office issues
No thinapp in thinapp factory + unable to create workpool
IE8 Thinapp crashing with IE 10 installed natively
ThinApp MS project and MS Visio 2010
Difference between ESXi and vSphere and VMware view ??
ThinAPP with AppSense

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

This approach can be successfully used to scrape heavy Javascript and AJAX sites
Instead of copying the JSON data from the Chrome console, you can use the FileSystem API to write the results to a file
You can also write the CSV directly from Chrome instead of using an extra script

If you like this article, you might also be interested in

Resources

Web Scraping for Semi-automatic Market Research

It is easy to web scrape Microsoft TechNet Forums (look at the xml output here: http://social.technet.microsoft.com/Forums/en-US/mdopappv/threads?outputAs=xml)and normalize the resulting information to have a better idea of each thread’s rank based on views and initial publication date. Knowing how issues are ranked can help a company choose what to focus on.

This code was used to scrape Microsoft TechNet’s forums. In the example below we scraped the App-V forum, since it is one of the application virtualization market’s leaders along with VMware ThinApp, and Symantec Workspace Virtualization.

These are the top ten threads for the App-V forum:

“Exception has been thrown by the target of an invocation”
Office 2010 KMS activation Error: 0xC004F074
App-V 5 Hotfix 1
Outlook 2010 Search Not Working
Java 1.6 update 17 for Kronos (webapp)
Word 2010 There was a problem sending the command to the program
Utility to quickly install/remove App-V packages
SAP GUI 7.1
The dreaded: “The Application Virtualization Client could not launch the application”
Sequencing Chrome with 4.6 SP1 on Windows 7 x64

The results show how frequently customers have issues with virtualizing Microsoft Office, Key Management Services (KMS), SAP, and Java. App-V competitors like Symantec Workspace Virtualization and VMWare ThinApp have similar problems. Researching markets this way gives you a good idea of areas where you can contribute solutions.

The scraper stores all the information in a SQLite database. The database can be exported using the csv_App-V.py script to an UTF-8 CSV file. We imported the file with Microsoft Excel and then normalized the ranking of the threads. To normalize it we divided the number of views by the age of the thread so threads with more views per day rank higher. Again, the scraper can be used on any Microsoft forum on Social TechNet. Try it out on your favorite forum.

Code

Prerequisites: lxml.html

The code is available at microsoft-technet-forums-scraping [github] . It was written by Matias Palomera from Nektra Advanced Computing, who received valuable support from Victor Gonzalez.

Usage

Run scrapper-App-V.py
Then run csv_App-V.py
The results are available in the App-V.csv file

Acknowledgments

Matias Palomera from Nektra Advanced Computing wrote the code.

Notes

This is a single thread code. You can take a look at our discovering web resources code to optimize it with multithreading.
Microsoft has given scrapers a special gift: it is possible to use the outputAs variable in the URL to get the structured information as XML instead of parsing HTML web pages.
Our articles Distributed Scraping With Multiple Tor Circuits and Running Your Own Anonymous Rotating Proxies show how to Implement your own rotating proxies infrastructure with Tor.

If you liked this article, you might also like:

Challenging Google’s Search Engine

Google is the undisputed search leader (88% market share in the US¹). Google is not only ahead of competitors in terms of quality of search results, infrastructure worthy of science fiction, and computer science research. Another of their strengths is how quickly they apply their own research.

How can Google be dethroned? Sure, there are other search engines, and newcomers like Blekko and Duck Duck Go make headlines from time to time. However, when you look more closely at those other search engines, you find that they cannot seriously compete with Google.

Benchmarking Search Engines

A search for “reverse engineering” on Blekko returns a hundred thousand results, while the same search on Google returns approximately two million. If it is so difficult for Blekko to compete at the crawling level, imagine what happens in the rest of the search engine pipeline. Just looking at Google’s search quality reports tells you that Page Rank was only the catalyst for much more sophisticated algorithms.

Duck Duck Go manages to attract a geeky audience with highlighted features like putting privacy first. If we search for “reverse engineering” on Duck Duck Go the results seem wacky: the second result is http://reverseengineeringinc.com/ a content poor site which just has the right domain name.

Google appears to be in a league by itself. It currently seems unlikely that they could lose significant market share due to an engineering weakness. In order to outdo Google, we must think holistically and try to guess how the web as a whole will evolve over the next ten years.

A Holistic Approach

Duck Duck Go created a two level search engine for sites like Wikipedia or YouTube. DDG offers DuckDuckGo Instant Answer API to incorporate the search engines of third parties. In order to take advantage of DDG and other two-tiered search engines, sites will have to improve their local search. Currently if you search using the local site search inside Stack Overflow, for example, the results are much lower quality than the same query in Google restricted to stackoverflow.com. When each site understands its own data better than Google, its internal search results will surpass Google’s. Google will no doubt continue to provide better global results, but the two-tiered search would decentralize efforts to improve algorithms. It is important to note that this solution does not need to be distributed: sites can share their local indexes and ranking algorithms with the routing search engine.

The fact that a small number of sites receive the majority of Internet traffic means that optimizing the top sites for a two layer search would make a big difference.

Notes

Search Engine Market Share

Additional Resources

Parsing S-Expressions in C# using OMeta

It is easy to parse S-Expressions in C# with OMeta. Our code limits the grammar to lists, and atoms of string, symbol, and number types. So, it is not complete, but it can easily be expanded with OMeta. What motivated me to write this article was the lack of publicly available S-Expression parsers in C#/.NET.

Our parser converts the expression (+ (* 3 4 5 6) (- 7 1) ) to the following tree:

where each vertex is represented by a C# class containing an ArrayList, Symbol, String, or Integer. Note that the expression (1) is different from the expression without parenthesis. The first is a list with one atom and the other is just the atom.

S-Expressions are a compact way to express programs and data structures. They were first defined for Lisp, but are used in a variety of areas including public key infrastructure. We use S-Expressions to define data flows in Egont, our web orchestration language. In Egont, each S-Expression produces a tree which is converted into a directed acyclic graph, the subject of a future post.

OMeta can be used under C# via the OMeta# project. That makes it more interesting since the classical lexical analyzer and parser generators such as Lex/flex and Yacc/GNU bison do not produce C# code. ANTLR is an interesting alternative but at the time of this post the latest version, ANTLR 4, does not support C#. OMeta’s ability to deal with ambiguities makes it more suited to playing with grammars. However, there are performance penalties in OMeta which must be taken into account.

Code

The code is available as SExpression.NET [github.com].

Compile the RebuildParser project first
Run the Test project
The SExpression project contains the SExpression.ometacs parser and its related C# classes

Additional Resources

IronMeta: another OMeta implementation in C#
YaYAML: a YAML parser written in OMeta#
OMeta Performance
Domain-Speciﬁc Languages: An Annotated Bibliography

Searching for Substrings in Streams: a Slight Modification of the Knuth-Morris-Pratt Algorithm in Haxe

It is odd that the base libraries for most programming languages do not allow you to search for regular expressions and substrings in streams or partial reads. We have modified the KMP algorithm so that it accepts virtually infinite partial strings. The code is implemented in Haxe, so it can generate code in multiple programming languages.

Streams are important when working with data that does not fit in main memory, such as large files, or with data which is being transferred. There are a few implementations of regular expressions and substrings matching. One is the Jakarta Regexp, now retired and resting in the Apache Attic. The Jakarta Regexp library “match” method in the RE class uses a CharacterIterator as a parameter. In C++, Boost.Regex implements partial matches.

Our code is implemented in Haxe so the same code can target Javascript, ActionScript, Flash SWF, NekoVM, PHP, C++, C#, and Java. We really like the concept of writing one code and expanding it to a variety of platforms with minimum effort. There are excellent libraries in specific environments that can work perfectly in other environments. Porting libraries from one programming language to another is tedious. For example, the amazing NetworkX graph library implemented in Python can be easily ported to C# to benefit a broader audience.

Code

Prerequisites

Haxe (tested on version 2.10)
For C++: hxcpp (run haxelib install hxcpp)
For Java: hxjava (run haxelib install hxjava)
For Mono/C#: jxcs (run haxelib install hxcs)

Source code available on github.

Resources

Enriching a List of URLs with Google Page Rank

Dealing with a large body of web resources can be daunting. You make a list of hundreds of blogs, but how do you share or recall those resources later? You must somehow organize your list. Many people do this with tags, but this is not necessarily the best option. Manual organization is also tedious, so tools for enriching data automatically came in handy. The relevance of different resources changes over time. What we originally tagged as “breakthrough” may come insignificant.

Last week I saw a friend who had recently started a new job and wanted my opinion about current and future technological trends. I wanted to give him links to thousands of resources that I have been accumulating over the years, but organized in such a way that he would not have to view them one at a time. This triggered an avalanche of ideas about how to enrich lists of links. My first thought was to rank my list of sites about venture capital and data science using Google Page Rank. I also considered adding the number of tweets, likes, and “+1” for each site but these are generally awarded for individual articles, not whole sites. I ended up adding the Google Page Rank with project pagerank.

The most interesting ideas to explore, though, are in another direction: how to boost items that are in the long tail. The best music may not make the Top 40, and so remains invisible. Algorithms better at recognizing value in the long tail would revolutionise the economy.

The code is available on github. Two examples of the output are available on data-science-bundle and venture-capital-bundle.

Esoteric Queue Scheduling Disciplines

New Challenges Requires New Tools

Big Data challenges current message oriented middleware (MOM) applications. MOM usually works with FIFO and priority scheduling disciplines. What happens if there is a large list of URLs ready to be crawled but you want to give URLs at the end of the list a chance of being crawled earlier? This concept comes from genetics, and is used in genetic algorithms selection schemes. The last URLs may contain interesting new resources in spite of their order or priority. Consuming from a FIFO queue takes a long time to crawl these URLs. Priority scheduling is more helpful, but it is not possible to know apriori how useful a URL will end up being in the quest for new Internet resources. Why not add a chance factor to URL selection by using roulette wheel scheduling and an efficient algorithm?

Data flows follow an order of execution based on task dependencies. One task cannot start until the preceding tasks have finished. This is the way a spreadsheet works. A change in a cell triggers a series of processes to be completed in topological order. Why not add task dependency to MOM applications? An item can be consumed from the queue only if its precedent tasks have been completed. We provide some data flow resources at the end of the Egont Part II article. However, a new queue scheduling discipline could be used in place of a separate framework. Ideally, the new queue discipline would include features such as persistence and transactions.

Roulette Wheel Scheduling Algorithm Design

To the best of our knowledge, there are currently no Internet resources about using a roulette wheel scheduling discipline for a queue.

The external interface for a roulette wheel queue is the same as for a typical queue with “get” and ”put” methods except that the “put” method incorporates a new probabilistic parameter. Probabilities can be expressed as integers. When a consumer requests an item, a random number is generated to decide which item is selected. Items with higher probabilities have a greater chance of being retrieved, but even items with low probabilities can be consumed.

The implementation of an efficient roulette wheel queue is not easy. Genetic algorithms use roulette wheel selection to choose between a small set of alternatives. A queue used for crawling can contain a lot of URLs, and the question is how to take these processes into account in order to find, add, and remove URLS efficiently.

Finding an item in a roulette wheel data structure is O(n) for trivial traversing and O(log(n)) using a binary search. Adding an item is trivial, it can be added at the end and the new total is the sum of the previous probability parameters total plus the new item’s probability parameter. Removing an item is more difficult. The trivial, but not the best, solution is to recalculate all the partial sums after the element which is being removed. A better solution is to use a heap tree data structure or one of its variants.

An alternative that merits further study is the use of Fenwick trees. In 1994, Peter M. Fenwick discovered how to improve the finding, adding, and modifying of items and the calculation of their subtotals. Since the Fenwick tree works over a fixed range of items, item keys must be preallocated.

Resources

Photo taken by Kristofer Björkman

Menu

If you liked this article, you might also like:

Extension Ideas

Basic Web Scraping Etiquette

If you liked this article, you might also like:

Preface

Code

Instructions

Customization

Acknowledgments

If you like this article, you might also be interested in

Further Reading

Procedure

Our Results: Top Twenty Threads

Acknowledgments

Notes

If you like this article, you might also be interested in

Resources

Code

Usage

Acknowledgments

Notes

If you liked this article, you might also like:

Benchmarking Search Engines

A Holistic Approach

Notes

Additional Resources

See Also

Code

See Also

Additional Resources

Code

Prerequisites

See Also

Resources

See Also

New Challenges Requires New Tools

Roulette Wheel Scheduling Algorithm Design

See Also

Resources