Fwd: [tahoe-dev] [p2p-hackers] convergent encryption reconsidered

Fri Mar 21 02:12:36 EDT 2008

Dear Perry Metzger:

Jim McCoy asked me to forward this, as he is not subscribed to  
cryptography at metzdowd.com, so his posting bounced.

Regards,

Zooko

Begin forwarded message:

> From: Jim McCoy <mccoy at mad-scientist.com>
> Date: March 20, 2008 10:56:58 PM MDT
> To: theory and practice of decentralized computer networks <p2p- 
> hackers at lists.zooko.com>
> Cc: tahoe-dev at allmydata.org, Cryptography <cryptography at metzdowd.com>
> Subject: Re: [tahoe-dev] [p2p-hackers] convergent encryption  
> reconsidered
> Reply-To: tahoe-dev at allmydata.org
>
>
> On Mar 20, 2008, at 12:42 PM, zooko wrote:
>
>>   Security engineers have always appreciated that convergent
>>   encryption allows an attacker to perform a
>>   confirmation-of-a-file attack -- if the attacker already knows
>>   the full plaintext of a file, then they can check whether a
>>   given user has a copy of that file.
>
> The truth of this depends on implementation details, and is an
> assertion that cannot be said to cover all or even most of the
> potential use-cases for this technique.  This property only holds if
> it is possible for the attacker to link a selected ciphertext/file to
> a user.  Systems which use convergent encryption to populate a shared
> storage pool _might_ have this property, but is my no means a
> certainty; if a system is implemented correctly is is not necessary
> for users to expose their list of files in order to maintain this
> shared storage space.
>
>>   basically people can tell which files you are storing if they
>>   are "publically known" files, but they can't learn anything
>>   about your own personal files.
>
> It sounds like you have a design problem.  If nodes that participate
> in the system can distinguish between publication and _re_- 
> publication/
> replication (or whatever you want to call the random sharing of
> arbitrary data blocks for the purposes of increasing file
> availability) then you have a problem.  If these two activities are
> indistinguishable then an observer knows you have some blocks to a
> file but should not be able to distinguish between you publishing the
> blocks and the act of re-distribution to increase block availability.
>
>>  The Learn-Partial-Information Attack [...]
>
> A better title for this would be "Chosen-Plaintext attack on
> Convergent Encryption", since what you are talking about is really a
> chosen plaintext attack.  To be a bit more specific, this is really
> just a version of a standard dictionary attack.  The solution to this
> problem is to look at similar systems that suffered from dictionary
> attacks an see what solutions were created to solve the problem.
>
> The most widely known and studied version of this is the old crypt()/
> passwd problem.
>
>>   For another example, if you use Tahoe to backup your entire
>>   home directory, or your entire filesystem, then the attacker
>>   gains the opportunity to try to learn partial information about
>>   various files which are of predictable format but have
>>   sensitive fields in them, such as .my.cnf (MySQL configuration
>>   files), .htpasswd, .cvspass, .netrc, web browser cookie files,
>>   etc..
>
> The problem with this imagined attack are twofold.  I will use your
> Tahoe example for my explanations because I have a passing familiarity
> with the architecture.  The first problem is isolating the original
> ciphertext in the pool of storage.  If a file is encrypted using
> convergent encryption and then run through an error-correction
> mechanism to generate a number of shares that make up the file an
> attacker first needs to be able to isolate these shares to generate
> the orginal ciphertext.  FEC decoding speeds may be reasonably fast,
> but they are not without some cost.  If the storage pool is
> sufficiently large and you are doing your job to limit the ability of
> an attacker to see which blocks are linked to the same FEC operation
> then the computational complexity of this attack is significantly
> higher than you suggest.
>
> Assuming an all-seeing oracle who can watch every bit sent into the
> storage pool will get us around this first problem, but it does raise
> the bar for potential attackers.
>
> The second problem an attacker now faces is deciding what sort of
> format a file might have, what the low-entropy content might be, and
> then filling in values for these unknowns.  If your block size is
> small (and I mean "really small" in the context of the sort of systems
> we are talking about) there might be only a few kilobits of entropy in
> the first couple of blocks of a file so either a rainbow-table attack
> on known file formats or a dedicated effort to grab a specific file
> might be possible, but this is by no means certain.  Increase your
> block size and this problem becomes much harder for the attacker.
>
>>  Defense Against Both Attacks
>>
>>   [...]
>>   However, we can do better than that by creating a secret value
>>   and mixing that value into the per-file encryption key (so
>>   instead of symmetric_key = H(plaintext), you have symmetric_key
>>   = H(added_secret, plaintext), where "," denotes an unambiguous
>>   encoding of both operands). This idea is due to Brian Warner
>>   and Drew Perttula.
>
> Bad idea, for a couple of reasons.  This first problem is that you are
> assuming the added secret is adding a significant amount of entropy
> and the second is that you are throwing out the advantage of
> convergent encryption.  If the secret is shared across multiple files
> then I can run a known-plaintext attack on your system using a file
> that I assume I have in common with my target (e.g. a standard, small,
> OS file) to get their added_secret and then once I know my target's
> secret I move on to the file I really want to go after.  If your
> userbase consists of paranoid cyperpunks then the added secret just
> becomes a lookup in a rainbow table, and if they are the sort of
> userbase that some people like to call "the consumer market" then the
> entropy being added here is negligible.
>
> A better solution would be to look at something like bcrypt() (see "A
> Future-Adaptable Password Scheme") and use this mechanism for files
> below a certain threshold.  I do not think that the expensive key
> scheduling operation discounts the use of CTR mode, so you are safe on
> that side of things, but it would mean that you step away from AES...
>
>>  A Comment About Space Savings
>>
>>   One of the original motivations for convergent encryption, as
>>   expressed in Doceur's technical report, is to conserve storage
>>   space by coalescing identical files. [...] My suspicion is
>>   that the gains available for modern uses are nowhere near that
>>   good -- I wouldn't be surprised if it were less than 5% [...]
>
> I would be very, very surprised if you were correct.
>
>>   My reasoning is:
>>    1. The Doceur et al. test set was probably the ideal test set
>>       to highlight the advantages of coalescing: it was hundreds
>>       of workstations, which probably all followed the same
>>       Information Technology Department policies, which were
>>       probably homogeneous, and which were probably packed with
>>       many of the same software products (especially Microsoft
>>       products).
>
> Are you suggesting the rest of the world uses software products that
> are radically different than those used in a corporate environment :)
> We all use a small set of operating systems, that share the same basic
> set of files.  We all use the same small set of application software
> and even if there is a "long tail" in application software the amount
> of storage this occupies compared to the operating system and most
> common applications is likely to be a very small fraction of the total
> being stored.
>
>>    2. Movies and music. In 2002, on Windows workstations in an
>>       office at Microsoft, there probably weren't any movies. For
>>       some of the probable use cases for Tahoe, movies and
>>       collections of music are the largest component. The
>>       presence of movie files, which are large, could increase or
>>       decrease the proportion of savings from coalescing files,
>>       depending on the degree to which those files are shared
>>       among multiple users, which brings us to the next point:
>
> This was actually a concern of mine when originally designing the
> system, but for a different reason.  At that time the large content
> files we were concerned with were music.  We knew that the amount of
> available music was a bounded total, but at that point in time the
> world had not converged on a standard set of codecs and standard
> bitrates for these codecs.  We worried about the fact that the same
> song would end up with multiple copies in the pool because some people
> would use LAME, some would use WinAMP, others a pirated Fraunhofer
> codec, and none of them would use the same bitrate.  This problem
> still exists somewhat, but the fact that most people use only a few
> packages for CD ripping, codecs have standardized, and most encoding
> uses only a few standard bitrates makes the job much easier and the
> storage savings to be realized significantly larger.
>
> When it comes to movies, the fact that a lot of them are *cough*
> "distributed" from a few master copies and seldom re-encoded means
> that you are likely to see many copies of the same bits when it comes
> to this category.  DVD ripping is also a fairly standardized process,
> so you win again.
>
> Given the fact that people are not buying terabyte drives to store
> their scans of Proust folios I think we can make an educated guess as
> to what is taking up a lot of space on these drives and how much
> duplication there is among these bits.
>
>>    3. The Long Tail; Today there seems to be a more diverse set
>>      of things taking up a bigger portion of the total.
>
> For all the talk of "the long tail" it seems that we all tend to visit
> the same set of YouTube clips and archive a lot of the same
> information.  More info is being generated by diverse groups, but the
> amount being generated still seems to be growing a lot slower than
> available storage capacity.  One side-effect of the long-tail is that
> as groups become more diverse in their taste in information they still
> tend to share the information that is specific to their interests.
> You might see the storage advantages decline somewhat until you have a
> significant userbase, but when considered in relation to point #2 this
> is really a minor issue.
>
>>    4. Personally-produced photos and movies. It seems like some
>>       of the files that people are most eager to back up and to
>>       share are the product of their own cameras. Such files will
>>       typically be shared by only one or a few users.
>
> This is the only point I can agree upon.  Personally-generated media
> (which is distinct from "user-generated content") is what makes
> individual systems unique.  On a generic user's computer this breaks
> down into three basic categories:
>
> 	-Personal documents
> 	-Photos
> 	-"Home" movies
>
> The size of the first is negligible.  The second is large and getting
> larger, and the third is probably a lot smaller than most of us
> think.  Photos are really the only one that will skew this result.
>
> In conclusion I would have to say that the sky is not really falling.
> The attack outlined is a known problem that can be solved in ways that
> still preserve the advantages of convergent encryption.
>
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev

---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com