Web scraping – part 2

Web scraping


In the previous article we identified the footprints to use and now it’s time to start coding.

First we need to fetch content from a website. In PHP we can do this using CURL or file_get_contents.

Downloading source code from website

You’ll need a text editor. For this project I picked sublime.

<?php
$content = file_get_contents('https://how-to-hack.net');
print_r( $content );
?>
web scraping with php download website source code

Download source code in PHP

To test it we need the php client, type:

php scraper.php

 

 

 

 

Now, we need to search for our footprints in the source code and extract the version number. We can do this using regular expression.

Regular expression

A regular expression, often called a pattern, is an expression used to specify a set of strings required for a particular purpose.

Extracting wordpress version from source code using regular expressions

Extracting wordpress version from source code using regular expressions

https://regex101.com/ is a website which makes it easy to put together and test a regular expression.

To search for the text “WordPress x.x.x” and extract the version number from it we use the following regular expression:

WordPress[^"]+(d+.d+.d+)

You can see the regular expression here: https://regex101.com/r/hW5oV0/1

 

preg_match_all is the PHP function we use to search and extract the version number.

<?php
$content = file_get_contents('https://how-to-hack.net');
$footprint = '/WordPress[^"]+(d+.d+.d+)/'; // in PHP we need to start and end with /
preg_match_all($footprint,$content,$m);
print_r($m);
?>

Making our tool more generic

Our tool can already extract the version number, but let’s take it to the next level. We will instruct our tool to look for signature, conditions and warning message in a separate file. This way we can add more signatures that is triggered by different conditions and that will yield a user friendly message.


{ "vulnerabilities":[
{
"content":"/content="WooCommerce[^"]+(\d+.\d+.\d+)"/",
"condition": "version_compare($result, '2.0.14','<')",
"warning":"Outdated version of Woocommerce! ($result)"
},{
"content":"/content="WordPress[^"]+(\d+.\d+.\d+)"/",
"condition": "version_compare($result, '4.1.2','<')",
"warning":"nOutdated version of WordPress! ($result)"
}
]}

Now we need to rewrite our web scraper so it can match from the signature file and so that we can pass the URL to the php code without editing our scraper.


<?php
// get url from command line
$content = file_get_contents($argv[1]);

// open file signatures.txt and convert it to an object (which is actually a json file).
$fileobj = json_decode(file_get_contents("signatures.txt"));

// itearate each footprint (WooCommerce & WordPress)
foreach($fileobj->vulnerabilities as $record) {

// extract the the part to match with the source code
preg_match_all($record->content,$content,$m);

// if it's not found; continue with next signature.
if(!isset($m[1][0]))
continue;

// save the extracted part (version number) into $result variable
$result = $m[1][0];

// let php evaluate if the condition matched.
//If it does, print the warning message from the signature file.
if(eval("return {$record->condition};"))
eval("print "{$record->warning}";");
}
die("n");
?>

web scraper in PHP

The final result of our web scraper in PHP

 

php -f scraper.php https://how-to-hack.net

Note: The website is flagged as vulnerable because the signature is set to look for v4.1.2 of WordPress while the current WordPress version is 4.1.1.

Good job hacker!

Now test your tool in practice and as always, remember to use proxy and don’t do anything illegal!

Leave a Reply

Your email address will not be published.