h1s_l0rdsh1p
06-09-2007, 09:44 AM
But if anyone is will to help, I say you contact these guys.
I put this thread in about these guys last week, but nobody cared I think. But I think maybe someone here will want to help.
This is the email I got, and you can sign up as a volunter for free.
The project has received a CD of several dozen gigabytes of multi-
language email from an asian government, comprised of unix mbox's and
Maildirs from various sources many with scanned documents attached as
images. We also have received other large email collections from
other sources and expect email and email attachments to be our
primary source type.
We are looking for people to work on how to present this material so
it can be analyzed by the world community.
There seem to be several basic approaches:
1. static files by rendered by hypermail / monharc / pipermail / etc
with code / CSS for a wikileaks frame and pointers back to the wiki
+ easy for mirrors, distributability, offline reading, fast
(reading)
+ fairly seasy to index with lucerne search
+ easy for web-search engines to index
+ hypermail, at least, has good threading
- display is quite inflexible
- In testing several months ago, hypermail was not able to
correctly pick up the mime type for some of the asian languages
- extra disk space needed
- possibly too slow / memory hungry though this needs further
research
2. store the material on an imap/pop server modified to accept any
password for access, and modify a webmail package such as squirrel
mail to read it and provide links back to the wiki, set up the web-
mail-reader so that it is spiderable by search engines
+ very flexible
+ backend store / POP is mostly future safe
+ proven approach for large volumes of email
- difficult to distribute
- maintenance
3. custom data base / mail scanning code
+ great flexibility
+ possible to hook up with mediawiki / semantic mediawiki in very
interesting ways
+ a basic WL protype for scanning / input / attachment extract &
duplicate detection already done (in python, but the author has had
to break for their phd)
+ over 50% space savings realizable through duplicate
elimination / compression / demimeing. However space is increasingly
affordable.
+ useful in other projects
+ many search types possible with sql
- time consuming (programer time)
- maintenance
- some degree of reinvention
- distributability, but a basic static snapshot would also be
easy to render
- not the simplest thing that will definitely work
In all these approaches we are also looking for an initial filtering
solution (possibly just scripts and configuration) to remove spam,
bulk emails and viruses which make up large part of the content. The
problem space is somewhat easier than spam filtering for regular
email, since we are interested in removing all bulk email, even
legitimate mailinglist traffic, since widely sent email is already
public and even a 1% false positive is tolerable. This means we
should be able to use distributed checksums (e.g DCC) without threshold.
If you find this interesting let us know.
J.
Now, it's just too advanced for me, and I'm busy with my own projects aswell. But if anybody is interested, please let them know, and help these guys out.
-T(R)E
PS
Forgot to say, the website is wikileaks.org
-T(R)E
I put this thread in about these guys last week, but nobody cared I think. But I think maybe someone here will want to help.
This is the email I got, and you can sign up as a volunter for free.
The project has received a CD of several dozen gigabytes of multi-
language email from an asian government, comprised of unix mbox's and
Maildirs from various sources many with scanned documents attached as
images. We also have received other large email collections from
other sources and expect email and email attachments to be our
primary source type.
We are looking for people to work on how to present this material so
it can be analyzed by the world community.
There seem to be several basic approaches:
1. static files by rendered by hypermail / monharc / pipermail / etc
with code / CSS for a wikileaks frame and pointers back to the wiki
+ easy for mirrors, distributability, offline reading, fast
(reading)
+ fairly seasy to index with lucerne search
+ easy for web-search engines to index
+ hypermail, at least, has good threading
- display is quite inflexible
- In testing several months ago, hypermail was not able to
correctly pick up the mime type for some of the asian languages
- extra disk space needed
- possibly too slow / memory hungry though this needs further
research
2. store the material on an imap/pop server modified to accept any
password for access, and modify a webmail package such as squirrel
mail to read it and provide links back to the wiki, set up the web-
mail-reader so that it is spiderable by search engines
+ very flexible
+ backend store / POP is mostly future safe
+ proven approach for large volumes of email
- difficult to distribute
- maintenance
3. custom data base / mail scanning code
+ great flexibility
+ possible to hook up with mediawiki / semantic mediawiki in very
interesting ways
+ a basic WL protype for scanning / input / attachment extract &
duplicate detection already done (in python, but the author has had
to break for their phd)
+ over 50% space savings realizable through duplicate
elimination / compression / demimeing. However space is increasingly
affordable.
+ useful in other projects
+ many search types possible with sql
- time consuming (programer time)
- maintenance
- some degree of reinvention
- distributability, but a basic static snapshot would also be
easy to render
- not the simplest thing that will definitely work
In all these approaches we are also looking for an initial filtering
solution (possibly just scripts and configuration) to remove spam,
bulk emails and viruses which make up large part of the content. The
problem space is somewhat easier than spam filtering for regular
email, since we are interested in removing all bulk email, even
legitimate mailinglist traffic, since widely sent email is already
public and even a 1% false positive is tolerable. This means we
should be able to use distributed checksums (e.g DCC) without threshold.
If you find this interesting let us know.
J.
Now, it's just too advanced for me, and I'm busy with my own projects aswell. But if anybody is interested, please let them know, and help these guys out.
-T(R)E
PS
Forgot to say, the website is wikileaks.org
-T(R)E