Effective Searching on OpenFT

There are thousands of services on the internet that offer some kind of search facility, from web search engines such as Google to the various P2P filesharing networks. Confusingly, they all have different rules as to how they treat your search queries. Here are OpenFT's.

Tokens

OpenFT handles searches very simply. It splits your query into individual words, or "tokens", and returns all files that match all of the tokens.

Recently (still unreleased, but will eventually be in 0.2.1.5) exclusion searches and phrase searches were implemented.

There are no wildcards, no boolean searches, and no way to match only parts of words.

What's searched

Filenames, including the pathname from the top directory you shared (the "sharing root"). In addition, the following metadata tags: title, artist, album, genre and tracknumber. These are currently extracted from MP3 and Ogg Vorbis audio files (other filetypes are recognised but don't usually contain any of the above tags).

More about tokens

The following characters are classed as "punctuation" and removed:

, (comma)
` (backtick)
' (apostrophe)
! (exclamation mark)
? (question mark)
* (asterisk)

This means that, for example, "its" and "it's" are treated identically.

The following characters are classed as "delimiters", or token separators:

\ (backslash)
/ (slash)
(space)
_ (underline)
- (hyphen)
. (period)
[, ], ( and ) (brackets)

(For completeness, the tab character, ASCII 9, is also a delimiter, although it's unlikely ever to appear in a search query.)

Finally, tokens are case-insensitive.

Quoting

Enclosing a group of tokens in quotes (") will match only results that contain that exact same sequence of tokens ("phrase"). Phrases won't match across tags; file paths are broken at each directory separator, so phrases won't match across directory boundaries either.

Queries may contain any combination of phrases and unquoted tokens. The final closing quote may be omitted.

Exclusions

Prefixing tokens with - (minus sign) will match results that do not contain that token.

(Note that the - convention is implemented by the client, so it's not guaranteed to work everywhere, and some clients may have an entirely different convention such as providing a separate input box for exclusions. Check your client's documentation.)

Excluding quoted phrases is not implemented, and may cause unpredictable results.

Examples

linux: matches all files that contain the word "linux".
"linux": same as above.
linux avi: matches all files that contain both the words "linux" and "avi", not necessarily consecutively (and not necessarily in that order either).
bang bang: matches all files that contain "bang" at least twice, anywhere. This will be true even if the occurrences are in separate tags (e.g. in title and in the filename).
"bang bang": matches all files that contain the phrase "bang bang". (The usual punctuation and delimiter rules apply, so "bang!_bang" will also match.)
"pajama crisis" "hope to find": matches all files that contain both of the quoted phrases.
*.mpeg: matches all files that contain "mpeg" (the wildcard specifier is treated as punctuation and ignored).
python -monty: matches all files that contain "python" but not "monty".
/etc/passwd: matches all files that contain "etc" and "passwd" ("/" is a delimiter).
"/etc/passwd": will not match files called /etc/passwd, because phrases stop at directory boundaries (it will match files called etc_passwd.txt however, or mp3s or oggs with "/etc/passwd" in the tags).

Realm searches

There exists the option to limit your searches to a certain media type, one of audio, video, image, text or application. (Note that clients differ in what they call the realms; I'm using OpenFT's terminology here, which is the same as that used by MIME. Also note that giFTcurs offers "hash" and "user" searches in the realm menu; these are not realm searches; see below.)

Unfortunately, this is implemented using MIME types, which means it doesn't always do what you want. MIME types are determined solely by extension; the file contents are not read.

Searching in the audio realm will match files ending in ".mp3" (audio/mpeg), and also ".ogg" (audio/x-vorbis, although strictly speaking it should be application/ogg). Windows Media files (".wma") will not be matched, as giFT doesn't recognise the extension (possibly a good thing); neither will FLAC files (".flac") (not such a good thing).

Searching in the video realm will match ".avi" (video/x-msvideo), and ".mpeg" (video/mpeg), but not Ogg movies (as ".ogg" is already categorized under audio). (Some people use ".ogm" for Ogg movies for this reason; giFT doesn't understand those either.)

Similarly, text will match genuine plain text files, but not PDF documents (application/pdf) or M$ Word documents. And anything in a zip archive is classed as an application too. In fact, the only realm that does anything like what you might expect is image.

Basically it's a minefield, full of traps for the unwary. I'd highly recommend avoiding realm searches (except possibly image). If you really want only a specific filetype, adding the extension to your search terms is usually much more reliable.

For a complete list of extensions and their associated realms, see the data/mime.types file in the giFT distribution (or share/giFT/mime.types wherever you installed giFT).

Internationalization

OpenFT (and indeed giFT) is character set- and encoding-agnostic. Filenames are left in whatever format the filesystem stores them in (except on NT-based systems, where I've no idea what happens but they're certainly not kept in native UCS-2). Metadata from MP3s is typically Latin-1; metadata from Vorbis is UTF-8.

Case-insensitivity is achieved using ANSI C's tolower() function, which means it's entirely locale-dependent.

Hence, searching for anything outside simple ASCII is unreliable.

Other search types

In addition to token searches, there are two other types of searches available: source searches and user searches.

Source searches take a hexadecimal MD5 hash (32 characters long) prefixed by "MD5:", and return all users sharing the file with that hash. This is used internally for "find more sources", but you can specify any hash.

User searches take an IP address in dotted quad notation, with an optional "username@" prefix, and return all shares from the user with that IP. The username is ignored if present. (OpenFT currently only allows one connection per IP, so this is unambiguous.)

Final advice

If you're behind a firewall or NAT, be sure to open/forward your OpenFT ports so you can receive incoming connections. This will approximately triple the number of search results you get.

It may seem obvious, but check your spelling before you search! There's no attempt to correct misspelled words; you'll either get no results or worse, a small number of results from people who've also misspelled the word in question.

Don't search for the same thing repeatedly. If you didn't get any results the first time, they won't magically appear when you search again immediately afterwards. Similarly, trying to avoid the 800-result limit this way is doomed to failure: you'll simply get the same results each time. Try searching for something more specific, or consider browsing individual users instead (which has no result limit).

Don't search for things like "preteen" or "gay piss" or "16 yo-cheerleader with wet white panties". Regardless of whether people are actually sharing such stuff, just don't, OK? Ugh.

Version history

0.2, 2004-08-08.
Updated for upcoming 0.2.1.5 release. Added hash prefixes, removed Super Cow Powers™.
0.1, 2003-09-22.
Initial version.

Tom Hargreaves (mailto:hex@freezone.co.uk).