miércoles, 15 de febrero de 2012

Parallel Programming in Perl Scripts

-I don't promote piracy and this post should be understood from a programming point of view-

So, recently I read an article about a certain person who copied every magnet link in The Pirate Bay into one single compressed file. Obviously it's just the magnet links, and not the files itself, and apparently he did it in case The Pirate Bay was shut down or something of that nature(we probably know about what happened to Megaupload, if not you can read this paper).

I've never heard of magnet links before, but it appears that the whole Pirate Bay magnet links fits in a simple USB, in 164MB or 90MB compressed, which is really small. An example of a magnet link is the following:

7015954|Ubuntu 11.10 Alternate 64-bit|707047424|2|5|5316391aed813d4283178dce2b95c8ad56c5be72


  • 7015954 is the ID that pirate bay uses for the torrent
  • Ubuntu 11.10 Alternate 64-bit The name of the file or something like that
  • 707047424 The size of the file in bytes
  • 2 Is the number of seeders at the time of the snapshot
  • 5 Is the number of leechers
  • 5316391aed813d4283178dce2b95c8ad56c5be72 is the magnet link hash
  • Now that wasn't exactly what I wanted to talk but it was neccessary to understand what is next.
  • An user named "allisfine" made a Perl script to copy all the information of the magnet links in The Pirate Bay(or almost all), into one single file, which had around 1,643,194 torrents in just 90MB.
  • Later it was updated to 7 million links, with 560MB weight.

That is what I want to talk about, the Perl  script, which got my attention after I saw some lines in it regarding parallel programming. And this was somewhat obvious, if that "allisfine" guy made a script to copy all the magnet links, it would take a big load of time to do it normally because it would copy the links one by one, but using parallel programming several links could be copied almost simultaneously.

The script is the following:

(Code taken from: http://pastebin.com/8RXXthXB)

As we can see it contains some lines about parallel programming, I'm not going to try to explain it all because I don't have any experience in Perl , but I will explain it superficially, and hope to understand it a little bit deeper in order to someday program my own Perl  scripts.

  1. use warnings;
  2. use strict;
  3. use Parallel::ForkManager;
  4. use 5.010;
To import functions from modules.I would think of it as an import.
my $pm=new Parallel::ForkManager(50);
This line defines a fork manager used to perform parallel task in a single Perl script.It is especially well-suited to performing a number of repetitive operations on a relatively powerful machine, especially when working on a multiprocessor machine. The parameter 50, it's the maximum number of processes to fork.

  1. $pm->run_on_finish(sub{
  2.     my (undef, undef, undef, undef, undef, $res_ref) = @_;
  3.     my ($res, $line) = @$res_ref;
  4.     if ($res == 1) {
  5.             open my $outf, ">>", "outf";
  6.             flock($outf, LOCK_EX) or next;
  7.             seek($outf, 0, SEEK_END);
  8.             say $outf $line;
  9.             say $line;
  10.             flock($outf, LOCK_UN);
  11.     }
  12. });
 I found that the run_on_finish method will make this block of code to be called at the point in the execution when each process is finished, but I don't particularly understand much of any of the lines. 

  1. my $i = 1;
  2. while (1) {
  3.     $i++;
  4.     $pm->start and next;
  5.     my $res;
  6.     my $page="";
  7.     $page = `curl -s http://thepiratebay.se/torrent/$i -m 120`
  8.         while ($page !~ /<!DOCTYPE html/);
  9.     my $line = "";
  10.     if ($page =~ m{<title>Not Found}) {
  11.         $res = 0;
  12.     } else {
  13.         $res = 1;
  14.         my ($title) = $page =~ /<div id="title">\s*(.*?)\s*<\/div>/s;
  15.         my ($size) = $page =~ /<dt>Size:<\/dt>\s*<dd>.*?\((\d*)&nbsp;Bytes\)<\/dd>/s;
  16.         my ($seeders) = $page =~ /<dt>Seeders:<\/dt>\s*<dd>(\d*)<\/dd>/;
  17.         my ($leechers) = $page =~ /<dt>Leechers:<\/dt>\s*<dd>(\d*)<\/dd>/;
  18.         my ($magnet) = $page =~ /magnet:\?xt=urn:btih:(.*?)(&|")/;
  19.         $line =
  20.         $i."|".$title."|".$size."|".$seeders."|".$leechers."|".$magnet;
  21.     }
  22.     $pm->finish(0,[$res, $line]);
  23. }
    This would be the main part of the code, it defines a simple counter "i", used to move from page to page starting from page 1, using curl to read the html from the page. Then it checks if the page is found or not, if it is found, it copies the title, size, seeders, leechers, and magnet link from the html, and saves it on a single variable "line", then it finishes the process and returning two values, "res", which tells if the page was found or not, and "line", the line with the information.


References:


https://thepiratebay.se/torrent/7016365
http://www.urgente24.com/195225-la-ciberguerra-en-la-web-hace-mutar-a-pirate-bay?pagination=1
http://www.perlmonks.org/?node_id=291446

1 comentario: