From blogging to log analysis and search engine optimisation (SEO) people are looking for scripts that can parse web pages and RSS feeds from other websites - to see where their traffic is coming from among other things.
Parsing your own HTML is no problem - assuming you use consistent formatting - but once you set your sights at parsing other people's HTML the frustration really sets in. This page presents some regular expressions and a commentary that will hopefully point you in the right direction.
1. Simplest Case
Let's start with the simplest case - a well formatted link with no extra attributes:
/(.*)<\/a>/iU This, believe it or not, is a very simple regular expression (or "regexp" for short). It can be broken down as follows: starts with:- a series of characters up to, but not including, the next double-quote (") - 1st capture
- the string: ">
- a series of any characters - 2nd capture
- ends with:
We're also using two 'pattern modifiers':
- i - matches are 'caseless' (upper or lower case doesn't matter)
- U - matches are 'ungreedy'
The first modifier means that we're matching as well as . The 'ungreedy' modifier is necessary because otherwise the second captured string could (by being 'greedy') extend from the contents of one link all the way to the end of another link.
/
\shref=\"([^\"]*)\">(.*)<\/a>/siU For more information on pattern modifiers see the link at the bottom of this page.
2. Room for Extra Attributes
/[^>]*href=\"([^\"]*)\"[^>]*>(.*)<\/a>/siU
3. Allow for Missing Quotes
/]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
Firstly, what's with those extra ?'s?
4. Refining the Regexp
spaces around the = after href:
/]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
matching only links starting with http:
/]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
single quotes around the link address:
/]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU
5. Using the Regular Expression to parse HTML
Using the default for preg_match_all the array returned contains an array of the first 'capture' then an array of the second capture and so forth. By capture we mean patterns contained in ():
# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header. $url = "http://www.example.net/somepage.html"; $input = @
file_get_contents($url) or die('Could not access file: $url'); $regexp = "]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { # $matches[2] = array of link addresses # $matches[3] = array of link text - including HTML code }
Using PREG_SET_ORDER each link matched has it's own array in the return value:
# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header. $url = "http://www.example.net/somepage.html"; $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text } }
If you find any cases where this code falls down, let us know using the Feedback link below.
Before using this or similar scripts to fetch pages from other websites, we suggest you read through the related article on setting a user agent and parsing robots.txt.
6. First checking robots.txt
As mentioned above, before using a script to download files you should always check the relevant robots.txt file. Here we're making use of the robots_allowed function from the article linked above to determine whether we're allowed to access the file:
# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header. ini_set('user_agent', 'NameOfAgent (http://www.example.net)'); $url = "http://www.example.net/somepage.html"; if(robots_allowed($url, "NameOfAgent")) { $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text } } } else { die('Access denied by robots.txt'); }
Now you're well on the way to building a professional web spider. If you're going to use this in practice you might want to look at: caching the robots.txt file so that it's not downloaded every time (a la Slurp); checking the server headers and server response codes; and adding a pause between multiple requests - for starters.
No comments:
Post a Comment