Fwd: [tahoe-dev] [p2p-hackers] convergent encryption reconsidered
zooko
zooko at zooko.com
Fri Mar 21 02:12:36 EDT 2008
Dear Perry Metzger:
Jim McCoy asked me to forward this, as he is not subscribed to
cryptography at metzdowd.com, so his posting bounced.
Regards,
Zooko
Begin forwarded message:
> From: Jim McCoy <mccoy at mad-scientist.com>
> Date: March 20, 2008 10:56:58 PM MDT
> To: theory and practice of decentralized computer networks <p2p-
> hackers at lists.zooko.com>
> Cc: tahoe-dev at allmydata.org, Cryptography <cryptography at metzdowd.com>
> Subject: Re: [tahoe-dev] [p2p-hackers] convergent encryption
> reconsidered
> Reply-To: tahoe-dev at allmydata.org
>
>
> On Mar 20, 2008, at 12:42 PM, zooko wrote:
>
>> Security engineers have always appreciated that convergent
>> encryption allows an attacker to perform a
>> confirmation-of-a-file attack -- if the attacker already knows
>> the full plaintext of a file, then they can check whether a
>> given user has a copy of that file.
>
> The truth of this depends on implementation details, and is an
> assertion that cannot be said to cover all or even most of the
> potential use-cases for this technique. This property only holds if
> it is possible for the attacker to link a selected ciphertext/file to
> a user. Systems which use convergent encryption to populate a shared
> storage pool _might_ have this property, but is my no means a
> certainty; if a system is implemented correctly is is not necessary
> for users to expose their list of files in order to maintain this
> shared storage space.
>
>> basically people can tell which files you are storing if they
>> are "publically known" files, but they can't learn anything
>> about your own personal files.
>
> It sounds like you have a design problem. If nodes that participate
> in the system can distinguish between publication and _re_-
> publication/
> replication (or whatever you want to call the random sharing of
> arbitrary data blocks for the purposes of increasing file
> availability) then you have a problem. If these two activities are
> indistinguishable then an observer knows you have some blocks to a
> file but should not be able to distinguish between you publishing the
> blocks and the act of re-distribution to increase block availability.
>
>> The Learn-Partial-Information Attack [...]
>
> A better title for this would be "Chosen-Plaintext attack on
> Convergent Encryption", since what you are talking about is really a
> chosen plaintext attack. To be a bit more specific, this is really
> just a version of a standard dictionary attack. The solution to this
> problem is to look at similar systems that suffered from dictionary
> attacks an see what solutions were created to solve the problem.
>
> The most widely known and studied version of this is the old crypt()/
> passwd problem.
>
>> For another example, if you use Tahoe to backup your entire
>> home directory, or your entire filesystem, then the attacker
>> gains the opportunity to try to learn partial information about
>> various files which are of predictable format but have
>> sensitive fields in them, such as .my.cnf (MySQL configuration
>> files), .htpasswd, .cvspass, .netrc, web browser cookie files,
>> etc..
>
> The problem with this imagined attack are twofold. I will use your
> Tahoe example for my explanations because I have a passing familiarity
> with the architecture. The first problem is isolating the original
> ciphertext in the pool of storage. If a file is encrypted using
> convergent encryption and then run through an error-correction
> mechanism to generate a number of shares that make up the file an
> attacker first needs to be able to isolate these shares to generate
> the orginal ciphertext. FEC decoding speeds may be reasonably fast,
> but they are not without some cost. If the storage pool is
> sufficiently large and you are doing your job to limit the ability of
> an attacker to see which blocks are linked to the same FEC operation
> then the computational complexity of this attack is significantly
> higher than you suggest.
>
> Assuming an all-seeing oracle who can watch every bit sent into the
> storage pool will get us around this first problem, but it does raise
> the bar for potential attackers.
>
> The second problem an attacker now faces is deciding what sort of
> format a file might have, what the low-entropy content might be, and
> then filling in values for these unknowns. If your block size is
> small (and I mean "really small" in the context of the sort of systems
> we are talking about) there might be only a few kilobits of entropy in
> the first couple of blocks of a file so either a rainbow-table attack
> on known file formats or a dedicated effort to grab a specific file
> might be possible, but this is by no means certain. Increase your
> block size and this problem becomes much harder for the attacker.
>
>> Defense Against Both Attacks
>>
>> [...]
>> However, we can do better than that by creating a secret value
>> and mixing that value into the per-file encryption key (so
>> instead of symmetric_key = H(plaintext), you have symmetric_key
>> = H(added_secret, plaintext), where "," denotes an unambiguous
>> encoding of both operands). This idea is due to Brian Warner
>> and Drew Perttula.
>
> Bad idea, for a couple of reasons. This first problem is that you are
> assuming the added secret is adding a significant amount of entropy
> and the second is that you are throwing out the advantage of
> convergent encryption. If the secret is shared across multiple files
> then I can run a known-plaintext attack on your system using a file
> that I assume I have in common with my target (e.g. a standard, small,
> OS file) to get their added_secret and then once I know my target's
> secret I move on to the file I really want to go after. If your
> userbase consists of paranoid cyperpunks then the added secret just
> becomes a lookup in a rainbow table, and if they are the sort of
> userbase that some people like to call "the consumer market" then the
> entropy being added here is negligible.
>
> A better solution would be to look at something like bcrypt() (see "A
> Future-Adaptable Password Scheme") and use this mechanism for files
> below a certain threshold. I do not think that the expensive key
> scheduling operation discounts the use of CTR mode, so you are safe on
> that side of things, but it would mean that you step away from AES...
>
>> A Comment About Space Savings
>>
>> One of the original motivations for convergent encryption, as
>> expressed in Doceur's technical report, is to conserve storage
>> space by coalescing identical files. [...] My suspicion is
>> that the gains available for modern uses are nowhere near that
>> good -- I wouldn't be surprised if it were less than 5% [...]
>
> I would be very, very surprised if you were correct.
>
>> My reasoning is:
>> 1. The Doceur et al. test set was probably the ideal test set
>> to highlight the advantages of coalescing: it was hundreds
>> of workstations, which probably all followed the same
>> Information Technology Department policies, which were
>> probably homogeneous, and which were probably packed with
>> many of the same software products (especially Microsoft
>> products).
>
> Are you suggesting the rest of the world uses software products that
> are radically different than those used in a corporate environment :)
> We all use a small set of operating systems, that share the same basic
> set of files. We all use the same small set of application software
> and even if there is a "long tail" in application software the amount
> of storage this occupies compared to the operating system and most
> common applications is likely to be a very small fraction of the total
> being stored.
>
>> 2. Movies and music. In 2002, on Windows workstations in an
>> office at Microsoft, there probably weren't any movies. For
>> some of the probable use cases for Tahoe, movies and
>> collections of music are the largest component. The
>> presence of movie files, which are large, could increase or
>> decrease the proportion of savings from coalescing files,
>> depending on the degree to which those files are shared
>> among multiple users, which brings us to the next point:
>
> This was actually a concern of mine when originally designing the
> system, but for a different reason. At that time the large content
> files we were concerned with were music. We knew that the amount of
> available music was a bounded total, but at that point in time the
> world had not converged on a standard set of codecs and standard
> bitrates for these codecs. We worried about the fact that the same
> song would end up with multiple copies in the pool because some people
> would use LAME, some would use WinAMP, others a pirated Fraunhofer
> codec, and none of them would use the same bitrate. This problem
> still exists somewhat, but the fact that most people use only a few
> packages for CD ripping, codecs have standardized, and most encoding
> uses only a few standard bitrates makes the job much easier and the
> storage savings to be realized significantly larger.
>
> When it comes to movies, the fact that a lot of them are *cough*
> "distributed" from a few master copies and seldom re-encoded means
> that you are likely to see many copies of the same bits when it comes
> to this category. DVD ripping is also a fairly standardized process,
> so you win again.
>
> Given the fact that people are not buying terabyte drives to store
> their scans of Proust folios I think we can make an educated guess as
> to what is taking up a lot of space on these drives and how much
> duplication there is among these bits.
>
>> 3. The Long Tail; Today there seems to be a more diverse set
>> of things taking up a bigger portion of the total.
>
> For all the talk of "the long tail" it seems that we all tend to visit
> the same set of YouTube clips and archive a lot of the same
> information. More info is being generated by diverse groups, but the
> amount being generated still seems to be growing a lot slower than
> available storage capacity. One side-effect of the long-tail is that
> as groups become more diverse in their taste in information they still
> tend to share the information that is specific to their interests.
> You might see the storage advantages decline somewhat until you have a
> significant userbase, but when considered in relation to point #2 this
> is really a minor issue.
>
>> 4. Personally-produced photos and movies. It seems like some
>> of the files that people are most eager to back up and to
>> share are the product of their own cameras. Such files will
>> typically be shared by only one or a few users.
>
> This is the only point I can agree upon. Personally-generated media
> (which is distinct from "user-generated content") is what makes
> individual systems unique. On a generic user's computer this breaks
> down into three basic categories:
>
> -Personal documents
> -Photos
> -"Home" movies
>
> The size of the first is negligible. The second is large and getting
> larger, and the third is probably a lot smaller than most of us
> think. Photos are really the only one that will skew this result.
>
> In conclusion I would have to say that the sky is not really falling.
> The attack outlined is a known problem that can be solved in ways that
> still preserve the advantages of convergent encryption.
>
> _______________________________________________
> tahoe-dev mailing list
> tahoe-dev at allmydata.org
> http://allmydata.org/cgi-bin/mailman/listinfo/tahoe-dev
---------------------------------------------------------------------
The Cryptography Mailing List
Unsubscribe by sending "unsubscribe cryptography" to majordomo at metzdowd.com
More information about the cryptography
mailing list