Wednesday, December 3, 2008

Extract all email addresses from your GMail IMAP

I wanted to extract all the email addresses in my GMail. Sadly, GMail does not have this facility, and allows to export only the addresses you have written emails to. So I tried http://vallery.net/gmail/ about a week ago, but am yet to get the results!! So I decided to dirty my hands over the weekend.

I tried some Perl modules, but the installation was not clean (needs 'make', failed tests, and what not!!!). So I moved to PHP. I installed php5-cli and php5-imap on my Ubuntu, added the line
extension=imap.so
to /etc/php5/cli/php.ini

And then it was just a matter of playing around with the API described at http://in2.php.net/imap

This script scans all the emails in the 'All Mail' label of GMail (which includes all the emails in your account, even archived, bit not the Spam and Trash labels). For each mail, it extracts the TO, CC, BCC, etc. fields (all those fields which may contain an email address) and prints the output on the screen in the following format:


SENDER& <timestamp>& <email_address>& <name>
TO& <timestamp>& <email_address>& <name>
REPLY-TO& <timestamp>& <email_address>& <name>


The & is the separator of the different fields. First field shows which part of the email the address was extracted from. The second field shows the timestamp of the email message. Third field shows the email address and the last field shows the name as associated with the email.

For eg. if the email was sent like:


From: Bob Lee <bob.lee@ggmail.com>
To: Jack Lee <lee.jack@ggmail.com>


then, the extracted emails will look like:


TO& Sun, January 23, 2008& lee.jack@ggmail.com& Jack Lee
FROM& Sun, January 23, 2008& bob.lee@ggmail.com& Bob Lee


Note that, the standards are quite flexible, so only the first two fields are guaranteed to be presented, rest of the fields can be empty.

You can take the output of this script an process it any which way you wish. You can use the output to determine who you have talked to most, or what was the frequency of your conversations with a person, etc. For eg. I run the output file through a series of tools to get all unique addresses like so:


cat extracted_emails.csv | cut -d '&' -f 3 | grep @ | sort | uniq > all_emails_sort_uniq.txt


The script can be customized for other IMAP provider/accounts/folder by changing the first 6 lines of code in the script (after the license strip).

Caveats:
.) You should have IMAP enabled in your GMail > Settings > Forwarding and POP/IMAP section.
.) If the last two lines of the output seem like:


Warning: imap_headerinfo(): Bad message number in mine_emails_addrs_from_imap.php on line 31
empty header found at 18109


that means the extraction is complete.

But, if you see only the 'empty header...' line, that means the connection was broken, or something happened so the extraction was not completed. You need to pick the number in the last line (18109 in this case) and provide that to the script as it's only argument, so the script will start from that message, and not redo the whole thing (which may cause it to fail again somewhere). You need to repeat this until you are able to see the WARNING message in last-but-one line.

Using this script, I am thinking of providing a service similar to vallery.net's, but with more transparency and better response times. Let me know if you really need it.

Finally, here's the script:


#!/usr/bin/php
# Distributed under GPLv3 License, as published on
# http://www.opensource.org/licenses/gpl-3.0.html
# with the following substitutions:
#
# <AUTHOR> = Gurjeet Singh
# <YEAR> = 2008

$options = '/imap/ssl/novalidate-cert';

$user = 'singh.gurjeet@ggmail.com';
$password = 'xxxxxxxxxx';

$mailbox_string = '{imap.gmail.com:993/imap/ssl/novalidate-cert}[Gmail]/All Mail';

echo "Connecting...\n";

$mbox = imap_open ( $mailbox_string, $user, $password )
or die("can't connect: " . imap_last_error());

echo "Fetching headers...\n";

if( $_SERVER["argc"] > 1 )
{
$i = $_SERVER["argv"][1];
}
else
{
$i = 0;
}

for( ++$i )
{
#if( $i % 100 == 0 ){ sleep( 1 ); }

$h = imap_headerinfo( $mbox, $i + 1 );

if( empty( $h ) )
{
echo "empty header found at $i\n";
break;
}

for( $j = 0; $j <>to ); ++$j )
{
echo 'TO& ' . $h->date . '& ' . $h->to[$j]->mailbox . '@' . $h->to[$j]->host . '& ' . $h->to[$j]->personal . "\n";
}

for( $j = 0; $j <>from ); ++$j )
{
echo 'FROM& ' . $h->date . '& ' . $h->from[$j]->mailbox . '@' . $h->from[$j]->host . '& ' . $h->from[$j]->personal . "\n";
}

for( $j = 0; $j <>cc ); ++$j )
{
echo 'CC& ' . $h->date . '& ' . $h->cc[$j]->mailbox . '@' . $h->cc[$j]->host . '& ' . $h->cc[$j]->personal . "\n";
}

for( $j = 0; $j <>bcc ); ++$j )
{
echo 'BCC& ' . $h->date . '& ' . $h->bcc[$j]->mailbox . '@' . $h->bcc[$j]->host . '& ' . $h->bcc[$j]->personal . "\n";
}

for( $j = 0; $j <>reply_to ); ++$j )
{
echo 'REPLY_TO& ' . $h->date . '& ' . $h->reply_to[$j]->mailbox . '@' . $h->reply_to[$j]->host . '& ' . $h->reply_to[$j]->personal . "\n";
}

for( $j = 0; $j <>sender ); ++$j )
{
echo 'SENDER& ' . $h->date . '& ' . $h->sender[$j]->mailbox . '@' . $h->sender[$j]->host . '& ' . $h->sender[$j]->personal . "\n";
}

for( $j = 0; $j <>return_path ); ++$j )
{
echo 'RETURN_PATH& ' . $h->date . '& ' . $h->return_path[$j]->mailbox . '@' . $h->return_path[$j]->host . '& ' . $h->return_path[$j]->personal . "\n";
}

}

if( !imap_close( $mbox ) )
{
echo "close ret'd: $ret";
}
?>

No comments:

Post a Comment