Wednesday, December 3, 2008

Extract all email addresses from your GMail IMAP

I wanted to extract all the email addresses in my GMail. Sadly, GMail does not have this facility, and allows to export only the addresses you have written emails to. So I tried http://vallery.net/gmail/ about a week ago, but am yet to get the results!! So I decided to dirty my hands over the weekend.

I tried some Perl modules, but the installation was not clean (needs 'make', failed tests, and what not!!!). So I moved to PHP. I installed php5-cli and php5-imap on my Ubuntu, added the line
extension=imap.so
to /etc/php5/cli/php.ini

And then it was just a matter of playing around with the API described at http://in2.php.net/imap

This script scans all the emails in the 'All Mail' label of GMail (which includes all the emails in your account, even archived, bit not the Spam and Trash labels). For each mail, it extracts the TO, CC, BCC, etc. fields (all those fields which may contain an email address) and prints the output on the screen in the following format:


SENDER& <timestamp>& <email_address>& <name>
TO& <timestamp>& <email_address>& <name>
REPLY-TO& <timestamp>& <email_address>& <name>


The & is the separator of the different fields. First field shows which part of the email the address was extracted from. The second field shows the timestamp of the email message. Third field shows the email address and the last field shows the name as associated with the email.

For eg. if the email was sent like:


From: Bob Lee <bob.lee@ggmail.com>
To: Jack Lee <lee.jack@ggmail.com>


then, the extracted emails will look like:


TO& Sun, January 23, 2008& lee.jack@ggmail.com& Jack Lee
FROM& Sun, January 23, 2008& bob.lee@ggmail.com& Bob Lee


Note that, the standards are quite flexible, so only the first two fields are guaranteed to be presented, rest of the fields can be empty.

You can take the output of this script an process it any which way you wish. You can use the output to determine who you have talked to most, or what was the frequency of your conversations with a person, etc. For eg. I run the output file through a series of tools to get all unique addresses like so:


cat extracted_emails.csv | cut -d '&' -f 3 | grep @ | sort | uniq > all_emails_sort_uniq.txt


The script can be customized for other IMAP provider/accounts/folder by changing the first 6 lines of code in the script (after the license strip).

Caveats:
.) You should have IMAP enabled in your GMail > Settings > Forwarding and POP/IMAP section.
.) If the last two lines of the output seem like:


Warning: imap_headerinfo(): Bad message number in mine_emails_addrs_from_imap.php on line 31
empty header found at 18109


that means the extraction is complete.

But, if you see only the 'empty header...' line, that means the connection was broken, or something happened so the extraction was not completed. You need to pick the number in the last line (18109 in this case) and provide that to the script as it's only argument, so the script will start from that message, and not redo the whole thing (which may cause it to fail again somewhere). You need to repeat this until you are able to see the WARNING message in last-but-one line.

Using this script, I am thinking of providing a service similar to vallery.net's, but with more transparency and better response times. Let me know if you really need it.

Finally, here's the script:


#!/usr/bin/php
# Distributed under GPLv3 License, as published on
# http://www.opensource.org/licenses/gpl-3.0.html
# with the following substitutions:
#
# <AUTHOR> = Gurjeet Singh
# <YEAR> = 2008

$options = '/imap/ssl/novalidate-cert';

$user = 'singh.gurjeet@ggmail.com';
$password = 'xxxxxxxxxx';

$mailbox_string = '{imap.gmail.com:993/imap/ssl/novalidate-cert}[Gmail]/All Mail';

echo "Connecting...\n";

$mbox = imap_open ( $mailbox_string, $user, $password )
or die("can't connect: " . imap_last_error());

echo "Fetching headers...\n";

if( $_SERVER["argc"] > 1 )
{
$i = $_SERVER["argv"][1];
}
else
{
$i = 0;
}

for( ++$i )
{
#if( $i % 100 == 0 ){ sleep( 1 ); }

$h = imap_headerinfo( $mbox, $i + 1 );

if( empty( $h ) )
{
echo "empty header found at $i\n";
break;
}

for( $j = 0; $j <>to ); ++$j )
{
echo 'TO& ' . $h->date . '& ' . $h->to[$j]->mailbox . '@' . $h->to[$j]->host . '& ' . $h->to[$j]->personal . "\n";
}

for( $j = 0; $j <>from ); ++$j )
{
echo 'FROM& ' . $h->date . '& ' . $h->from[$j]->mailbox . '@' . $h->from[$j]->host . '& ' . $h->from[$j]->personal . "\n";
}

for( $j = 0; $j <>cc ); ++$j )
{
echo 'CC& ' . $h->date . '& ' . $h->cc[$j]->mailbox . '@' . $h->cc[$j]->host . '& ' . $h->cc[$j]->personal . "\n";
}

for( $j = 0; $j <>bcc ); ++$j )
{
echo 'BCC& ' . $h->date . '& ' . $h->bcc[$j]->mailbox . '@' . $h->bcc[$j]->host . '& ' . $h->bcc[$j]->personal . "\n";
}

for( $j = 0; $j <>reply_to ); ++$j )
{
echo 'REPLY_TO& ' . $h->date . '& ' . $h->reply_to[$j]->mailbox . '@' . $h->reply_to[$j]->host . '& ' . $h->reply_to[$j]->personal . "\n";
}

for( $j = 0; $j <>sender ); ++$j )
{
echo 'SENDER& ' . $h->date . '& ' . $h->sender[$j]->mailbox . '@' . $h->sender[$j]->host . '& ' . $h->sender[$j]->personal . "\n";
}

for( $j = 0; $j <>return_path ); ++$j )
{
echo 'RETURN_PATH& ' . $h->date . '& ' . $h->return_path[$j]->mailbox . '@' . $h->return_path[$j]->host . '& ' . $h->return_path[$j]->personal . "\n";
}

}

if( !imap_close( $mbox ) )
{
echo "close ret'd: $ret";
}
?>

Sunday, August 10, 2008

Difference between Standard and Scientific views of Microsoft Windows Calculator

I was trying to convert an IP address to an integer, so formulated the following equation to do it:

66 * 16777216 + 226 * 65536 + 18 * 256 + 71


I had relied on this formula in the past to suggest solutions to a customer in the past, which means I trusted it to do it's job!

I used the standard 'calc' utility from friendly Vista to do the calculation, and what answer does it return!!! :

18577352254558791


which at first sight I caught that was not in the 32 bit integer range!!! If that was the case, if this formula was wrong, the customer's app was screwed!! and implicitly, so was I!!!

I started checking, and cross-checking the calculation. First I turned to the Postgres database I had implemented this formula in:

select 66 * 16777216 + 226 * 65536 + 18 * 256 + 71;


The answer was 1122112071, which matched the customer's old (slow) function's result.

Then upon a few minutes of research, I figured that calc's scientific mode was producing correct result and the standard mode was not.

The difference, apparently, is because of the way both modes handle the operators. The Standard mode interprest the equation as:

((((((66 * 16777216 ) + 226 ) * 65536 ) + 18 ) * 256 ) + 71 )


and the scientific mode interpreted it as:

((66 * 16777216) + (226 * 65536) + (18 * 256) + 71 )


I tried the same expression with MinGW's 'expr' utility too and got the correct answer.

Must I say I was relieved to find that calc was crazy in "Standard" mode!!!

Wednesday, May 28, 2008

Run shell commands in parallel

As I said in my previous post, I am getting rid of my Ubuntu in a VM (Gutsy Gibbon running inside VirtualBox), I am posting another script that I think will be useful in some situations.

Here's a little background. The place where I am consulting (Hi5.com) we need to perform rsync on a huge directory tree. And since we want this operation to be as fast as possible, the first measure the guys there took was to use rsync protocol, and not use rsync-over-ssh; thats a great speed booster.

Next, they (actually, Kenny Gorman) devised three scripts, that we need to run after each other; one to generate a list of all files in the directory we want to copy, second to split that list into 4 equal pieces, and the third to actually run these 4 pieces (batches) in background, in parallel.

The problem with this approach is that some batches finish quickly, because the files those batches are rsyncing are smaller than the files that other batches are working on. The result: we start with 4 parallel rsync commands, but somewhere down the line only one or two of them are running. We loose parallelism quite quickly, and end up waiting for the batch(es) containing large files, and that is processing files in sequential order.

So, I got to work trying to parallelize a bunch of commands that are placed in a file. This script reads lines from it's standard-input-stream (stdin) and executes those lines using the shell. At any time, it will run only a specified number of commands, and wait for them to finish. As soon as one of the running command finishes, this script reads next line from stdin and executes that.

I have also added the ability to change the degree of parallelism while this script is running. Just create a file named 'degree' in /tmp/parallel.$PID/ and and put a number in there, denoting the new degree of parallelism. This is quite useful in tweaking the degree of parallelism depending on your system load.

I have made no special efforts in redirecting the stdin/stdout/stderr of the commands that are read and executed by this script. So, if you wish to record the progress of this script, or wish to store away your commands' output, just redirect this script's streams and save them.

An example usage of this script can to remove all the files under a directory, in parallel (although it is a very bad example for such a simple task):


find /home/gurjeet/dev/postgres -type f | sed -e 's/\(.*\)/rm $0/g' > tmp.txt
cat tmp.txt | parallel.sh


Here's the script:

#!/bin/bash
# This script is licensed under GPL 2.0 license.

# This script uses some special features (look for 'wait' command)
# provided by Bash shell.

# get my pid
mypid=$$;

# determine a dir/ where I will keep my running info
MYDIR=/tmp/parallel.$mypid;

# echo my pid for the logs
echo PARALLEL: pid: $mypid;

# remove the directory/file if it is left over from a previous run
if [ -e $MYDIR ] ; then
rm -r $MYDIR
fi

# make my dir/
mkdir $MYDIR

# determine the degreee of parallelization
degree=$1;

# default degree of parallelism, if not specified on command line
if [ "X$degree" = "X" ] ; then
degree=2;
fi

# echo for logs
echo PARALLEL: Degree of parallelism: $degree;

# read each line from stdin and process it

while read line ;
do

while [ true ]; do

# re-adjust degree of parallelization communicated through this file
if [ -f $MYDIR/degree ] ; then
new_degree=`cat $MYDIR/degree`
rm $MYDIR/degree
fi

if [ $new_degree > 0 ] ; then
degree=$new_degree;
fi

# Look for a free slot
for (( i = 0 ; i < $degree ; ++i )) ; do
if [ ! -e $MYDIR/parallel.$i ]; then
break
fi
done

if [ $i -lt $degree ]; then
break
fi

# if can't find any free slot, repeat after a sleep of 1 sec
sleep 1;

done

# occupy this slot
( # echo PARALLEL: touching $MYDIR/parallel.$i;
touch $MYDIR/parallel.$i )

# perform the task in background, and free the slot when done
( echo PARALLEL: $degree $mypid;
sh -c "$line";
# echo PARALLEL: removing $MYDIR/parallel.$i;
rm $MYDIR/parallel.$i ) &
done

# Wait for all child processes to finish
wait;

# echo PARALLEL: removing base dir;
rm -r $MYDIR;

Restart Ubuntu's Wireless Network Driver (script)

I have admitted on more than one occasion that I am a Windows fan; yes, even after using Vista! But when I got my new laptop, on which I installed Vista Business on my own, I tried to push myself into using Ubuntu; I'll leave blogging about that experience for some other post (on my RNFs). That was a long time ago (2 months to be precise) and this post is about something else.

I encountered too many network disconnections on Ubuntu. I noticed that the wireless' indicator on my laptop would just go away after using Ubuntu for a while. The only work-around to start the connection I had was to restart the OS! As I was very committed to using Ubuntu at any cost, I dug up the internet and found some clues. A little while later I developed this script.

What this script does is it uses the utility that is installed with the Intel (restricted) wireless driver, to check if the driver is still running,; if it is not, then it starts it, and if it is already tunning, it will kill and restart it. Worked like a magic for me for the week that I used Ubuntu after this.


$ cat restart_network_driver.sh

#!/bin/sh
# This code is in public domain, under GPL 2.0 license
if ipw3945d-2.6.22-14-generic --isrunning; then
echo killing \
&& ipw3945d-2.6.22-14-generic --kill \
&& echo restarting \
&& ipw3945d-2.6.22-14-generic --quiet \
&& ipw3945d-2.6.22-14-generic --isrunning; \
else
echo starting \
&& ipw3945d-2.6.22-14-generic --quiet \
&& ipw3945d-2.6.22-14-generic --isrunning;
fi


Here's what I was using:
OS: Ubuntu 7.04 (Gutsy Gibbon)
Laptop: Thinkpad R61i
Wireless: Intel ipw3945d (restricted) driver

and here's how to use it:

sudo ./restart_network_driver.sh


PS: I am posting it now because I am going to give Linux another shot, this time with Hardy heron; and wanted some place to save this script before I wipe out that partition.

Monday, May 26, 2008

ts: the timestampimg script

So I finally got around to implementing one of my ideas (which I don't get to do very often!). The idea was posted here: http://gurjeet-rnf.blogspot.com/2008/05/ts.html

I first thought of implementing it in C, and thought that I'd use the time-tested code from postgres sources. I wanted to implement the code in C for performance reason, but then it looked a bit complex to extract PG's code and make it work independently.

So I cooked up a simple shell script that uses the standard 'date' command to get us what we want. Here it is:


$ cat ts.sh

#!/bin/sh
while read line; do
echo `date`: $line
done


And here's a sample run, but first the script I used to test:


$ cat del.sh

#!/bin/sh
while [[ 1 ]] ; do echo gurjeet singh; sleep 1; done



And the sample run:

$ ./del.sh | ./ts.sh

Mon May 26 19:16:56 IST 2008: gurjeet singh
Mon May 26 19:16:58 IST 2008: gurjeet singh
Mon May 26 19:16:59 IST 2008: gurjeet singh
Mon May 26 19:17:00 IST 2008: gurjeet singh
Mon May 26 19:17:01 IST 2008: gurjeet singh
Mon May 26 19:17:02 IST 2008: gurjeet singh
Mon May 26 19:17:03 IST 2008: gurjeet singh
Mon May 26 19:17:04 IST 2008: gurjeet singh


Since I have a soft spot for Windows, and since this shell script cannot be easily utilized in Windows platform, I am working on a new binary, that will be based on the 'date' command, and work natively on Windows.

Sunday, May 18, 2008

My RNFs branched

This is a new branch from my RNF blog. I think the RNF blog is not appropriate place for technical writings; this were getting too mixed up there!

Techie-mee is a pun on the Mini-Me character from Austin-Powers movies. This is a mini version of my main blog, which is dedicated to everything technical inside of my head.

Update: The techie-mee blog has been renamed to gurjeet-tech, on the lines of gurjeet-rnf, the first blog. I think the new name makes more sense than the old one.