Yatsko's Linguistic Informatics
TF*IDF Ranker
TF*IDF Ranker allows the user getting scores for each term of input text according to the classic formula and its modified version.
We offer two versions of this software, the English one and TF*IDF Ranker_2L that also supports Russian. Both versions work on Windows machines. Go to Downloads section of this site to get the application.
This is the classic formula that is widely used in term weighting techniques:
where
w(ij) = weight of Term Ti in Document Dj
tf(ij) = frequency of Term Ti in Document Dj
N = number of documents in a corpus
n = number of documents where term Ti occurs at least once
Once you get a list of terms with their weights arranged in descending order you can copy it to some external editor to use it in various ways, for example to filter stop-words, i.e. words with zero or low scores or, on the contrary, use most salient words that have highest weights to represent the content of input text.
A drawback of this formula is that terms that occur in input text but that cannot be found in corpus get zero scores. In many cases such terms may be important for text understanding. For example, a scientist may describe his invention introducing some newly coined terms, or a writer may invent neologisms that are not registered in existing corpora. That is why we modified the classic formula: if a word occurs in input text but doesn’t occur in corpus n in the formula is assigned the value “1”, rather than “0”. This is the formula modified by V. A. Yatsko:
The main problem with this TF*IDF technique is number of texts in corpus, i.e. the value of N. How many texts must the corpus contain to be representative enough? No formal criteria have been developed so far and we work at this problem to suggest a solution in the nearest future.
How to use
1) With add button add texts to create a corpus. You can select and remove texts to create a new corpus.
2) Upload a text to analyze. For this text you will get TF*IDF scores.
3) Select a classic formula or a modified formula. Classic version is a default option.
4). Click analyze.
5). Get a list of terms arranged in descending order of their weights.
5) Copy the list to an external editor for further processing. You can get a more detailed information about frequency distribution of terms.
TF*IDF Ranker allows to effectively filter stopwords, though much depends on the corpus against which the input text is matched. Sometimes additional filtering is required with the help of a stoplist. The most representative English stopwords list was compiled by Cristopher Fox https://dl.acm.org/doi/10.1145/378881.378888 . Here is the extended list (we added 5 items) that comprises 426 stopwords:
a, about, above, across, after, again, against, all, almost, alone, along, already, also, although, always, among, an, and, another, any, anybody, anyone, anything, anywhere, are, area, areas, around, as, ask, asked, asking, asks, at, away, b, back, backed, backing, backs, be, because, become, becomes, became, been, before, began, behind, being, beings, best, better, between, big, both, but, by, с, came, can, cannot, case, cases, certain, certainly, clear, clearly, come, could, d, did, didn, differ, different, differently, do, does, don, done, down, downed, downing, downs, during, e, each, early, either, end, ended, ending, ends, enough, even, evenly, ever, every, everybody, everyone, everything, everywhere, f, face, faces, fact, facts, far, felt, few, find, finds, first, for, four, from, full, fully, further, furthered, furthering, furthers, g, gave, general, generally, get, gets, give, given, gives, go, going, good, goods, got, great, greater, greatest, group, grouped, grouping, groups, h, had, has, have, having, he, her, herself, here, high, higher, highest, him, himself, his, how, however, i, if, important, in, interest, interested, interesting, interests, into, is, it, its, itself, j, just, k, keep, keeps, kind, knew, know, known, knows, l, large, largely, last, later, latest, least, less, let, lets, like, likely, long, longer, longest, m, made, make, making, man, many, may, me, member, members, men, might, more, most, mostly, me, mr, mrs, much, must, my, myself, n, necessary, need, needed, needing, needs, never, new, newer, newest, next, no, non, not, nobody, noone, nothing, now, nowhere, number, numbers, о, of, off, often, old, older, oldest, on, once, one, only, open, opened, opening, opens, or, order, ordered, ordering, orders, other, others, our, out, over, p, part, parted, parting, parts, per, perhaps, place, places, point, pointed, pointing, points, possible, present, presented, presenting, presents, problem, problems, put, puts, q, quite, r, re, rather, really, right, room, rooms, s, said, same, saw, say, says, second, seconds, see, sees, seem, seemed, seeming, seems, several, shall, she, should, show, showed, showing, shows, side, sides, since, small, smaller, smallest, so, some, somebody, someone, something, somewhere, state, states, still, such, sure, t, take, taken, than, that, the, their, them, then, there, therefore, these, they, thing, things, think, thinks, this, those, though, thought, thoughts, three, through, thus, to, today, together, too, took, toward, turn, turned, turning, turns, two, u, under, until, up, upon, us, use, uses, used, v, ve, very, w, want, wanted, wanting, wants, was, way, ways, we, well, wells, went, were, what, when, where, whether, which, while, who, whole, whose, why, will, with, within, without, work, worked, working, works, would, y, year, years, yet, you, young, younger, youngest, your, yours.