m8ta
You are not authenticated, login. |
|
{568} | ||
My friend wanted to grab a whole bunch of data from a historical database site, http://www.a2a.org.uk/ . He was paying a person to manually copy all the records into excel (for a rather low fee, too, $150), and the guy was having a problem entering the rupee amounts, since it seems that in this historical data, currency is denominated my Rs xx-yy-z, where xx is rupees, yy are 1/16 rupee, and z is 1/64 of a rupee (cool base 2 system for currency, no?). Neverminding the currency detail, I told him that I could easily write a script to screen-scrape this data and export it to a CSV file. For reference, here it is: #!/usr/bin/perl $narg = $#ARGV + 1; if( $narg ne 1 ){ print "please specify the file to read"; }else{ $source = $ARGV[0]; local( $/, *FH ) ; open(FH, $source); $/ = "FILE"; @j = <FH>; #slurp entire file, split on 'FILE' close FH; #print "num\tcase\tplaintiff\tdefendant\tior\tdatestart\tdateend\tclaim\tr1\tr2\tr3\tcountry1\tcountry2\n"; foreach $l (@j){ # try to match the line.. # match must be robust, as some of the records are incomplete. my $case = ""; my $plaintiff = ""; my $defendant = ""; my $num = ""; my $ior = ""; my $datestart = ""; my $dateend = ""; my $claim = ""; my $r1 = "0"; my $r2 = "0"; my $r3 = "0"; my $cont1 = ""; my $cont2 = ""; $l =~ s/[\n\t]/ /g; # remove newlines &tabs. if($l =~ /Case\s([\d\/]+)([^<]+)/){ $num = $1; $case = $2; $case =~ s/^: //; #remove leading colon space. if($case =~ /\(([^\)]+)\)[^\(]+\(([^\)]+)\)/ ){ $cont1 = $1; $cont2 = $2; } #remove the countries (in parenthesis) $case =~ s/\([\w ]+\)//g; if($case =~ /(.+(?= v )) v (.+)/){ $plaintiff = $1; $defendant = $2; } if($case =~ /(.+(?= v\. )) v\. (.+)/){ $plaintiff = $1; $defendant = $2; } } if($l =~ /IOR([^<]+)/){ $ior = $1; } if($l =~ /date: <\/b>([^-]+)-([^<]+)<\/font>/){ $datestart = $1; $dateend = $2; }elsif($l =~ /date: <\/b>([^<]+)<\/font>/){ $datestart = $1; } if($l =~ /Claim<\/span>([^<]+)/ ){ $claim = $1; $claim =~ s/(\d),(\d)/$1$2/g; #remove commas from numbers. if($claim =~ /Rs\.* (\d+)[-\.](\d+)[-\.](\d+)/i){ $r1 = $1; $r2 = $2; $r3 = $3; }elsif($claim =~ /Rs\.* (\d+)[-\.](\d+)/i){ $r1 = $1; $r2 = $2; }elsif($claim =~ /Rs\.* (\d+)/i){ $r1 = $1; } } if($num ne ""){ print "$num\t$case\t$plaintiff\t$defendant\t$ior\t$datestart\t$dateend\t$claim\t$r1\t$r2\t$r3\t$cont1\t$cont2\n"; } } } run it with > , e.g. ./a2a_extract.pl document.html > out1.csvwhere document.html is saved from the web browser. |