Ways to find a torrent file using hash of desired file?

Ask for help and report issues not specific to either the Mac OS X or GTK+ versions of Transmission
Post Reply
Rinzwind42
Posts: 1
Joined: Thu Apr 01, 2010 4:02 pm

Ways to find a torrent file using hash of desired file?

Post by Rinzwind42 »

Are there any ways to, given the MD5, SHA1 or similar hash of a file, find a torrent file which will download a file with that hash? I'm specifically asking because I'm developing a Mac application to compute, store and verify file hashes. I'm going to add a feature to look up hashes through sites such as Bitzi and Dynmirror. I would like to also include a site, or some other way, for finding a torrent file based on a file hash. (The hash of a file downloaded through that torrent file, not the bittorrent info hash of the torrent file itself.)

I'm not sure I understand all the details of the bittorrent protocol and the use of bittorrent info hashes. But at least as far as I understand from this informal specification, BitTorrent applications are allowed to record MD5 sums of files in the field 'md5sum' of the info dictionary. So what I'm looking for should technically speaking be possible to do, even if an existing site or other means to lookup torrent files based on MD5 hashes doesn't exist yet? In that case, are there any plans to implement that?

BTW, the application is called Galton and is available for download if you click that link.

( Sorry, this is more of a general BitTorrent 'development' question than a request for Transmission 'support'; but I didn't know what forum would be more appropriate to ask this in. )
decipher
Posts: 1
Joined: Sat Apr 18, 2015 9:41 pm

Re: Ways to find a torrent file using hash of desired file?

Post by decipher »

Without varying from the Bittorrent protocol specification there is not a reliable way of finding files by hash.

When a torrent is created the associated files are not hashed individually. They are hashed as if though all the files had been concatenated one after another without padding.

If the piece size is 16 KB (Minimum size of a file from which you can create a single file torrent...I think) and you have 4 files where each of them is 4 KB in size then. The resulting bencoded torrent would have only one SHA1 (20 char piece hash). What happened? The torrent creation utility read each file until it had a least 16 KB of data to hash.

Another example:

piece size: 1 MB
files: 4
file_a size: 3 MB
file_b size: 2 MB
file_c size: 1023 KB
file_d_size: 1025 KB

There would be 3 piece hashes from file_a an additional 2 from file_b. When the program got to file_c it would have 1 byte less than piece size so it would then append the first byte from file_d to that instance of piece_data and hash it. Leaving the remaining data to be hashed as 1024 KB. Then finally there would be 7 piece hashes concatenated in the torrent file.

If you were to manually hash file_c then you would find a different hash then whats in the torrent file because of the additional byte that your unaware of. And as far as I know there is no way to match up those hashes even though at least partially generated from the same data.

That leaves you with three options:
1. Accept that searching by file hash isn't practical.
2. Pad the files (before or while) hashing(<-- don't do this as it isn't backwards compatible. Other torrent clients wouldn't be aware of it.)
3. Simply add the individual file hashes to the torrent meta info dictionary in the torrent file and have your implementation check for that entry.

Unless you have access to the actual torrent files then of course none of this does you any good but I will speculate:
Search by name then download the top 10 most relevant torrents and check for your extension.
Fall back to hash checking just in case the boundaries are intact although you'd actually be looking for pieces of the files rather than the entire files.
Other than that you could just conclude that if the torrent has a certain percentage matching pieces then it contains your file and you could match that up to a file name.

I just happened to be thinking about this myself and I've written a bittorrent client so I have the details. ;)
Post Reply