{ Snipperize } /Python
Python snippets
Hamming Distance
在信息领域,两个长度相等的字符串的海明距离是在相同位置上不同的字符的个数,也就是将一个字符串替换成另一个字符串需要的替换的次数。 例如: "toned" and "roses" is 3. 1011101 and 1001001 is 2. 2173896 and 2233796 is 3. 对于二进制来说,海明距离的结果相当于 a XOR b 结果中1的个数。
Python / hamming distance, similarity, distance / by ThePeppersStudio (58 days, 13.23 hours ago)
利用simhash来进行文本去重复
传统的hash函数能够将一样的文本生成一样的hash函数,但是,通过simhash方法,能够差不多相同的文档得到的hash函数也比较相近。 Charikar's hash 通过Charikar‘s hash,能够将比较相似度的文档得到比较相近的fingerprint。 该算法的流程如下: * Document is split into tokens (words for example) or super-tokens (word tuples) * Each token is represented by its hash value; a traditional hash function is used * Weights are associated with tokens * A vector V of integers is initialized to 0, length of the vector corresponds to the desired hash size in bits * In a cycle for all token's hash values (h), vector V is updated: o ith element is decreased by token's weight if the ith bit of the hash h is 0, otherwise o ith element is increased by token's weight if the ith bit of the hash h is 1 * Finally, signs of elements of V corresponds to the bits of the final fingerprint 该hash不是将文档总体计算hash值,而是将文档中的每个token计算哈希值,对文档中每个token的hash值,按照位 对hash值进行求和,如果当前token的hash值在该位上是0,则减去1,如果在该位上是1,则加上1.将所有的token按照这种方式累加,求的最终的值作为fingerprint。
Python / simhash, hash, Charikar, similarity, duplicate / by ThePeppersStudio (58 days, 13.28 hours ago)
Rename Unicode filename to pretty ASCII
This script converts accented characters in filenames to their ASCII equivalents. e.g.: â > a ä > a à > a á > a é > e í > i ó > o ú > u ñ > n ü > u ...
Python / unicode, unicodedata, glob, ascii, filename / by ThePeppersStudio (85 days, 16.28 hours ago)
Make unique file name
Sometimes it is important to save data in the file but the file with the specified name already exists. This function creates a file name that is similar to the original by adding a unique numeric suffix. This avoids the renaming of existing files.
Python / file, name, unique / by ThePeppersStudio (132 days, 17.78 hours ago)
IP and MAC addresses
This module collects all IP and MAC addresses from several available sources on the underlying system. See the module documentation for more details, supported Python releases and platforms.
Python / ip, mac, address, os, socket, struct, sys / by ThePeppersStudio (132 days, 17.84 hours ago)
Console Make Text Color and Bold
no comment
Python / console, color, bold / by ThePeppersStudio (155 days, 9.85 hours ago)
Rsync Algorithm In Python
An implementation of the rsync algorithm in Python. As my rolling checksum, I just summed all of the ascii byte values in a given window. Even with this simple weak rolling checksum, computing and comparing the produced rolling checksums is still terribly slow. I'm fairly certain the speed could be reduced a fair amount but I haven't decided what the most efficient manner of doing this in Python alone would be.
Python / algorithm, delta, diff, rsync / by ThePeppersStudio (177 days, 10.12 hours ago)
Efficient Algorithm for computing a Running Median
Maintains sorted data as new elements are added and old one removed as a sliding window advances over a stream of data. Running time per median calculation is proportional to the square-root of the window size.
Python / algorithm, indexable, median, running, skiplist, statistics / by ThePeppersStudio (177 days, 10.23 hours ago)
Browser history data structure
The BrowserHistory class encapsulates the history of moving from location to location, as in Web browsing context; the recipe is not restricted to Web browsing though. See docstrings for more details and usage. The current implementation requires Python 2.6.
Python / browser, history, track, web / by ThePeppersStudio (200 days, 8.53 hours ago)
Advanced Directory Synchronization module
This program is an advanced directory synchronization and update tool. It can be used to update content between two directories, synchronize them, or just report the difference in content between them. It uses the syntax of the 'diff' program in printing the difference.
Python / robocopier, file, synchronization, diff, filecmp / by ThePeppersStudio (213 days, 12.06 hours ago)
- Home
- New Snippet
- Languages
-

