Web scraping – part 2
Web scraping
In the previous article we identified the footprints to use and now it’s time to start coding.
First we need to fetch content from a website. In PHP we can do this using CURL or file_get_contents.
Downloading source code from website
You’ll need a text editor. For this project I picked sublime.
<?php $content = file_get_contents('https://how-to-hack.net'); print_r( $content ); ?>
To test it we need the php client, type:
php scraper.php
Now, we need to search for our footprints in the source code and extract the version number. We can do this using regular expression.
Regular expression
A regular expression, often called a pattern, is an expression used to specify a set of strings required for a particular purpose.
https://regex101.com/ is a website which makes it easy to put together and test a regular expression.
To search for the text “WordPress x.x.x” and extract the version number from it we use the following regular expression:
WordPress[^"]+(d+.d+.d+)
You can see the regular expression here: https://regex101.com/r/hW5oV0/1
preg_match_all is the PHP function we use to search and extract the version number.
<?php $content = file_get_contents('https://how-to-hack.net'); $footprint = '/WordPress[^"]+(d+.d+.d+)/'; // in PHP we need to start and end with / preg_match_all($footprint,$content,$m); print_r($m); ?>
Making our tool more generic
Our tool can already extract the version number, but let’s take it to the next level. We will instruct our tool to look for signature, conditions and warning message in a separate file. This way we can add more signatures that is triggered by different conditions and that will yield a user friendly message.
{ "vulnerabilities":[ { "content":"/content="WooCommerce[^"]+(\d+.\d+.\d+)"/", "condition": "version_compare($result, '2.0.14','<')", "warning":"Outdated version of Woocommerce! ($result)" },{ "content":"/content="WordPress[^"]+(\d+.\d+.\d+)"/", "condition": "version_compare($result, '4.1.2','<')", "warning":"nOutdated version of WordPress! ($result)" } ]}
Now we need to rewrite our web scraper so it can match from the signature file and so that we can pass the URL to the php code without editing our scraper.
<?php // get url from command line $content = file_get_contents($argv[1]); // open file signatures.txt and convert it to an object (which is actually a json file). $fileobj = json_decode(file_get_contents("signatures.txt")); // itearate each footprint (WooCommerce & WordPress) foreach($fileobj->vulnerabilities as $record) { // extract the the part to match with the source code preg_match_all($record->content,$content,$m); // if it's not found; continue with next signature. if(!isset($m[1][0])) continue; // save the extracted part (version number) into $result variable $result = $m[1][0]; // let php evaluate if the condition matched. //If it does, print the warning message from the signature file. if(eval("return {$record->condition};")) eval("print "{$record->warning}";"); } die("n"); ?>
php -f scraper.php https://how-to-hack.net
Note: The website is flagged as vulnerable because the signature is set to look for v4.1.2 of WordPress while the current WordPress version is 4.1.1.
Good job hacker!
Now test your tool in practice and as always, remember to use proxy and don’t do anything illegal!