October 6, 2008

PHP: Parsing HTML to find Links

From blogging to log analysis and search engine optimisation (SEO) people are looking for scripts that can parse web pages and RSS feeds from other websites - to see where their traffic is coming from among other things.

Parsing your own HTML is no problem - assuming you use consistent formatting - but once you set your sights at parsing other people's HTML the frustration really sets in. This page presents some regular expressions and a commentary that will hopefully point you in the right direction.

1. Simplest Case

Let's start with the simplest case - a well formatted link with no extra attributes:

/(.*)<\/a>/iU This, believe it or not, is a very simple regular expression (or "regexp" for short). It can be broken down as follows: starts with:

We're also using two 'pattern modifiers':

  • i - matches are 'caseless' (upper or lower case doesn't matter)
  • U - matches are 'ungreedy'

The first modifier means that we're matching as well as . The 'ungreedy' modifier is necessary because otherwise the second captured string could (by being 'greedy') extend from the contents of one link all the way to the end of another link.

One shortcoming of this regexp is that it won't match link tags that include a line break - fortunately there's a modifer for this as well:

/\shref=\"([^\"]*)\">(.*)<\/a>/siU

Now the '.' character will match any character including line breaks. We've also changed the first space to a 'whitespace' character type so that it can match a space, tab or line break. It's necessary to have some kind of whitespace in that position so we don't match other tags such as .

For more information on pattern modifiers see the link at the bottom of this page.

2. Room for Extra Attributes

Most link tags contain a lot more than just an href attribute. Other common attributes include: rel, target and title. They can appear before or after the href attribute:

/[^>]*href=\"([^\"]*)\"[^>]*>(.*)<\/a>/siU

We've added extra patterns before and after the href attribute. They will match any series of characters NOT containing the > symbol. It's always better when writing regular expressions to specify exactly which characters are allowed and not allowed - rather that using the '.' character.

3. Allow for Missing Quotes

Up to now we've assumed that the link address is going to be enclosed in double-quotes. Unfortunately there's nothing enforcing this so a lot of people simply leave them out. The problem is that we were relying on the quotes to be there to indicate where the address starts and ends. Without the quotes we have a problem.

It would be simple enough (even trivial) to write a second regexp, but where's the fun in that when we can do it all with one:

/]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

Note: There are many different ways of implementing this regular expression. Some may be better than the example presented here, but "If it ain't broke..."

What can I say? Regular expressions are a lot of fun to work with but when it takes a half-hour to work out where to put an extra ? your really know you're in deep.

Firstly, what's with those extra ?'s?

Because we used the U modifier, all patterns in the regexp default to 'ungreedy'. Adding an extra ? after a ? or * reverses that behaviour back to 'greedy' but just for the preceding pattern. Without this, for reasons that are difficult to explain, the expression fails. Basically anything following href= is lumped into the [^>]* expression.

We've added an extra capture to the regexp that matches a double-quote if it's there: (\"??). There is then a backreference \\1 that matches the closing double-quote - if there was an opening one.

To cater for links without quotes, the pattern to match the link address itself has been changed from [^\"]* to [^\" >]*?. That means that the link can be terminated by not just a double-quote (the previous behaviour) but also a space or > symbol.

4. Refining the Regexp

Given the nature of the WWW there are always going to be cases where the regular expression breaks down. Small changes to the patterns can fix these.

spaces around the = after href:

/]*href\s*=\s*(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

matching only links starting with http:

/]*href=(\"??)(http[^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

single quotes around the link address:

/]*href=([\"\']??)([^\" >]*?)\\1[^>]*>(.*)<\/a>/siU

And yes, all of these modifications can be added to the version above to make one super-regexp, but the result is just too painful to look at so I'll leave that as an exercise.

Note: All of the expressions on this page have been tested to some extent, but mistakes can occur in transcribing so please report any errors you may have found when implementing these examples.

5. Using the Regular Expression to parse HTML

Using the default for preg_match_all the array returned contains an array of the first 'capture' then an array of the second capture and so forth. By capture we mean patterns contained in ():

# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header. $url = "http://www.example.net/somepage.html"; $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches)) { # $matches[2] = array of link addresses # $matches[3] = array of link text - including HTML code }

Using PREG_SET_ORDER each link matched has it's own array in the return value:

# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header. $url = "http://www.example.net/somepage.html"; $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text } }

If you find any cases where this code falls down, let us know using the Feedback link below.

Before using this or similar scripts to fetch pages from other websites, we suggest you read through the related article on setting a user agent and parsing robots.txt.

6. First checking robots.txt

As mentioned above, before using a script to download files you should always check the relevant robots.txt file. Here we're making use of the robots_allowed function from the article linked above to determine whether we're allowed to access the file:

# Original PHP code by Chirp Internet: www.chirp.com.au # Please acknowledge use of this code by including this header. ini_set('user_agent', 'NameOfAgent (http://www.example.net)'); $url = "http://www.example.net/somepage.html"; if(robots_allowed($url, "NameOfAgent")) { $input = @file_get_contents($url) or die('Could not access file: $url'); $regexp = "]*href=(\"??)([^\" >]*?)\\1[^>]*>(.*)<\/a>"; if(preg_match_all("/$regexp/siU", $input, $matches, PREG_SET_ORDER)) { foreach($matches as $match) { # $match[2] = link address # $match[3] = link text } } } else { die('Access denied by robots.txt'); }

Now you're well on the way to building a professional web spider. If you're going to use this in practice you might want to look at: caching the robots.txt file so that it's not downloaded every time (a la Slurp); checking the server headers and server response codes; and adding a pause between multiple requests - for starters.

October 2, 2008

Using cURL with PHP

Basic cURL:
= curl_init('http://www.target.com'); // the target
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); // return the page
$result = curl_exec ($ch); // executing the cURL
curl_close ($ch); // Closing connection

echo $result;
?>
that code would get the source code of target.com, and echo it.

Post via cURL:
= "field_name=field_value&submit_value=submit\";

$ch = curl_init('http://www.target.com'); // the target
curl_setopt ($ch, CURLOPT_POST, 1); // telling cURl to POST
curl_setopt ($ch, CURLOPT_POSTFIELDS, $data);
curl_exec ($ch); // executing the cURL
curl_close ($ch); // Closing connection

?>


Simple posting via cURL.

- Using cookies with cURL:

( You will find this usefull when you are trying to do something that needs a login and a cookie stored )

= curl_init();
curl_setopt($ch, CURLOPT_URL, 'http://www.target.com');
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, '/path/to/cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, '/path/to/cookie.txt');
$result = curl_exec($ch);
curl_close($ch);

echo
$result;

?>

ofcourse you might need to post first into the login form, to get the cookies stored, then you can do other things with you being logged in.

- Extra info:

you can set `user-agent`, `referrer`, `headers`.. using cURL:

// set user-agent to DarkMindZ
curl_setopt($ch, CURLOPT_USERAGENT, 'DarkMindZ');
?>
// set referrer darkmindz.com
curl_setopt($ch, CURLOPT_REFERER, "http://www.darkmindz.com\");
?>


- Making life easier:

This function will help you alot in making things go easy:

function curl_it($method, $target, $post_var)
{
$ch = curl_init();
curl_setopt($ch, CURLOPT_URL, $target);
curl_setopt($ch, CURLOPT_USERAGENT, $_SERVER['HTTP_USER_AGENT']);
curl_setopt($ch, CURLOPT_FOLLOWLOCATION, 1);
curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1);
curl_setopt($ch, CURLOPT_COOKIEJAR, '/path/to/cookie.txt');
curl_setopt($ch, CURLOPT_COOKIEFILE, '/path/to/cookie.txt');

if (
$method == 'POST') {
curl_setopt($ch, CURLOPT_POST, 1);
curl_setopt($ch, CURLOPT_POSTFIELDS, $post_var);
}

$result = curl_exec($ch);
curl_close($ch);
}

// usage:

curl_it('', 'http://www.darkmindz.com'); // get darkmindz.com homepage

curl_it('POST', 'http://www.darkmindz.com', 'user=dude&pass=dude2'); // login using dude:dude2

?>

October 1, 2008

6 tips to write less code php

PHP is a good language, but there are always surprises. And today I've seen an interesting approach in Arnold Daniels's blog. He talks about temporary variables in PHP. This tip is useful to "lazy" developers who do not even think about variable names. They may prefer magic names like ${0} and 0 is good enough variable name, why not...

But I'm even more lazy then Arnold and sure that when there is no variable, then there is no problem. So here are a few tips that can make your code shorter and harder to read :-)

1. Use || (or) and && (and) operations instead of if.
// A lot of code
$status = fwrite($h, 'some text');
if (!
$status) {
log('Writing failed');
}

// Less code
${0} = fwrite($h, 'some text');
if (!${
0}) log('Writing failed');

// Even less code
fwrite($h, 'some text') or log('Writing failed');
2. Use ternary operator.
// A lot of code
if ($age < 16) {
$message = 'Welcome!';
} else {
$message = 'You are too old!';
}

// Less code
$message = 'You are too old!';
if (
$age < 16) {
$message = 'Welcome!';
}

// Even less code
$message = ($age < 16) ? 'Welcome!' : 'You are too old!';

3. Use for instead of while.
// A lot of code
$i = 0;
while (
$i < 100) {
$source[] = $target[$i];
$i += 2;
}

// less code
for ($i = 0; $i < 100; $source[] = $target[$i+=2]);

4. In some cases PHP requires you to create a variable. Some examples you can find in my PHP fluent API tips article. Another example is getting array element when array is returned by the function.
$ext = pathinfo('file.png')['extension'];

// result: Parse error: syntax error, unexpected '[' in ... on line ...
To handle all these situation you can create a set of small functions which shortcuts frequently used operations

// returns reference to the created object
function &r($v) { return $v; }

// returns array offset
function &a(&$a, $i) { return $a[$i]; }

5. Explore the language you use. PHP is very powerful and has a lot of functions and interesting aspects of the language which can make your code more efficient and short.

6. When it is better to write more and then read the code easily, do not be lazy.
Spend a few seconds and write a comment and more readable construction. This is the only tip in this list that really can save hours, not minutes.

50+ PHP optimisation tips revisited

After reading an article some time ago entitled “40 Tips for optimizing your php Code” (and some others that are suspiciously similar), I decided to redo it, but properly this time with more accurate tips, providing references and citations for each and every one.

The result is this list of over 50 PHP optimisation tips…

Enjoy!

  1. echo is faster than print. [Citation]
  2. Wrap your string in single quotes (’) instead of double quotes (”) is faster because PHP searches for variables inside “…” and not in ‘…’, use this when you’re not using variables you need evaluating in your string. [Citation]
  3. Use sprintf instead of variables contained in double quotes, it’s about 10x faster. [Citation]
  4. Use echo’s multiple parameters (or stacked) instead of string concatenation. [Citation]
  5. Use pre-calculations, set the maximum value for your for-loops before and not in the loop. ie: for ($x=0; $x < max="count($array)">
  6. Unset or null your variables to free memory, especially large arrays. [Citation]
  7. Avoid magic like __get, __set, __autoload. [Citation]
  8. Use require() instead of require_once() where possible. [Citation]
  9. Use full paths in includes and requires, less time spent on resolving the OS paths. [Citation]
  10. require() and include() are identical in every way except require halts if the file is missing. Performance wise there is very little difference. [Citation]
  11. Since PHP5, the time of when the script started executing can be found in $_SERVER[’REQUEST_TIME’], use this instead of time() or microtime(). [Citation]
  12. PCRE regex is quicker than EREG, but always see if you can use quicker native functions such as strncasecmp, strpbrk and stripos instead. [Citation]
  13. When parsing with XML in PHP try xml2array, which makes use of the PHP XML functions, for HTML you can try PHP’s DOM document or DOM XML in PHP4. [Citation]
  14. str_replace is faster than preg_replace, str_replace is best overall, however strtr is sometimes quicker with larger strings. Using array() inside str_replace is usually quicker than multiple str_replace. [Citation]
  15. “else if” statements are faster than select statements aka case/switch. [Citation]
  16. Error suppression with @ is very slow. [Citation]
  17. To reduce bandwidth usage turn on mod_deflate in Apache v2 [Citation] or for Apache v1 try mod_gzip. [Citation]
  18. Close your database connections when you’re done with them. [Citation]
  19. $row[’id’] is 7 times faster than $row[id], because if you don’t supply quotes it has to guess which index you meant, assuming you didn’t mean a constant. [Citation]
  20. Use tags when declaring PHP as all other styles are depreciated, including short tags. [Citation]
  21. Use strict code, avoid suppressing errors, notices and warnings thus resulting in cleaner code and less overheads. Consider having error_reporting(E_ALL) always on. [Citation]
  22. PHP scripts are be served at 2-10 times slower by Apache httpd than a static page. Try to use static pages instead of server side scripts. [Citation]
  23. PHP scripts (unless cached) are compiled on the fly every time you call them. Install a PHP caching product (such as memcached or eAccelerator or Turck MMCache) to typically increase performance by 25-100% by removing compile times. You can even setup eAccelerator on cPanel using EasyApache3. [Citation]
  24. An alternative caching technique when you have pages that don’t change too frequently is to cache the HTML output of your PHP pages. Try Smarty or Cache Lite. [Citation]
  25. Use isset where possible in replace of strlen. (ie: if (strlen($foo) <>
  26. ++$i is faster than $ i++, so use pre-increment where possible. [Citation]
  27. Make use of the countless predefined functions of PHP, don’t attempt to build your own as the native ones will be far quicker; if you have very time and resource consuming functions, consider writing them as C extensions or modules. [Citation]
  28. Profile your code. A profiler shows you, which parts of your code consumes how many time. The Xdebug debugger already contains a profiler. Profiling shows you the bottlenecks in overview. [Citation]
  29. Document your code. [Citation]
  30. Learn the difference between good and bad code. [Citation]
  31. Stick to coding standards, it will make it easier for you to understand other people’s code and other people will be able to understand yours. [Citation]
  32. Separate code, content and presentation: keep your PHP code separate from your HTML. [Citation]
  33. Don’t bother using complex template systems such as Smarty, use the one that’s included in PHP already, see ob_get_contents and extract, and simply pull the data from your database. [Citation]
  34. Never trust variables coming from user land (such as from $_POST) use mysql_real_escape_string when using mysql, and htmlspecialchars when outputting as HTML. [Citation]
  35. For security reasons never have anything that could expose information about paths, extensions and configuration, such as display_errors or phpinfo() in your webroot. [Citation]
  36. Turn off register_globals (it’s disabled by default for a reason!). No script at production level should need this enabled as it is a security risk. Fix any scripts that require it on, and fix any scripts that require it off using unregister_globals(). Do this now, as it’s set to be removed in PHP6. [Citation]
  37. Avoid using plain text when storing and evaluating passwords to avoid exposure, instead use a hash, such as an md5 hash. [Citation]
  38. Use ip2long() and long2ip() to store IP addresses as integers instead of strings. [Citation]
  39. You can avoid reinventing the wheel by using the PEAR project, giving you existing code of a high standard. [Citation]
  40. When using header(’Location: ‘.$url); remember to follow it with a die(); as the script continues to run even though the location has changed or avoid using it all together where possible. [Citation]
  41. In OOP, if a method can be a static method, declare it static. Speed improvement is by a factor of 4. [Citation].
  42. Incrementing a local variable in an OOP method is the fastest. Nearly the same as calling a local variable in a function and incrementing a global variable is 2 times slow than a local variable. [Citation]
  43. Incrementing an object property (eg. $this->prop++) is 3 times slower than a local variable. [Citation]
  44. Incrementing an undefined local variable is 9-10 times slower than a pre-initialized one. [Citation]
  45. Just declaring a global variable without using it in a function slows things down (by about the same amount as incrementing a local var). PHP probably does a check to see if the global exists. [Citation]
  46. Method invocation appears to be independent of the number of methods defined in the class because I added 10 more methods to the test class (before and after the test method) with no change in performance. [Citation]
  47. Methods in derived classes run faster than ones defined in the base class. [Citation]
  48. A function call with one parameter and an empty function body takes about the same time as doing 7-8 $localvar++ operations. A similar method call is of course about 15 $localvar++ operations. [Citation]
  49. Not everything has to be OOP, often it is just overhead, each method and object call consumes a lot of memory. [Citation]
  50. Never trust user data, escape your strings that you use in SQL queries using mysql_real_escape_string, instead of mysql_escape_string or addslashes. Also note that if magic_quotes_gpc is enabled you should use stripslashes first. [Citation]
  51. Unset your database variables (the password at a minimum), you shouldn’t need it after you make the database connection.
  52. RTFM! PHP offers a fantastic manual, possibly one of the best out there, which makes it a very hands on language, providing working examples and talking in plain English. Please USE IT! [Citation]

If you still need help, try #PHP on the EFnet IRC Network. (Read the !rules first).

Also see:

  • an Excellent Article about optimizing PHP by John Lim
  • PEAR coding standards
  • PHP best practices by ez.no (Use left and right keys to scroll through the pages)
  • Tuning Apache and PHP for Speed on Unix
  • Premature Optimisation
  • PHP and Performance
  • Performance Tuning PHP
  • Develop rock-solid code in PHP
  • 12 PHP optimization tips
  • 10 things you (probably) didn’t know about PHP