Ever wonder if we can quantify a gender bias in society?
Using machine learning, we can generate word associations present in a given media source.
By looking at those associations we can tell how closely words are related to women or men.
The Gender Graph project allows users to plot where words lie on a scale of "he" to "she" based on a selected media source.
Enter your words and observe the differences that exist in the way we perceive gender.
Manifesto
Observing this chart clearly reveals that the media commonly associates toxic
words with women. We consume this media every day therefore subliminally
inherit these biases. Much of our community believes that feminism isn’t
relevant anymore as women and men have “equal rights”. Hopefully this scientific
evidence will be concrete proof of the disparities that exist in the way we perceive
gender, and that we still have a long way to go.
Traning sources
The Gender Graph Project currently has three models trained
Wiki is the dump of English Wikipedia which include 70 000 unique words. This dataset is also know as text8
Reddit include on ~1.7 billion publicly available Reddit comments. It's include 2 milions unique words
Google News includes 94 829 news artiles, posted in Google News website
0
How does it work?
In order for computer to understand english words, they need to be
converted to numbers. In particular each word can be represented as a
point in multidimensional space. It can be roughly visualized in two dimensions.
We use the word2vec tool to generate these word vectors based on semantic
relationships between words in a given text source. This collection of word
vectors is called a model. We wrote custom tool that uses this model to
score user words in relationship to given pair of words (in our case he and she).
In order to quantify if a word is more commonly associated with women
or men, we can find how far away this word is positioned from “she” and “he”.
Mathematically, it can be accomplished by finding the vector direction
between “she” and “he”, and projecting user words onto this vector using
simple vector properties such as the dot product.
The length of the projection onto this axis gives us an association score,
where values closer to 0.0 are related to “he”, and values closer to 1.0 are related to “she”.
This approach give us a very good picture of semantic biases in the media. However, it is important to understand that in reality
these models are not perfect. Factors such as data quantity, quality, and algorithmic imperfections may introduce
noise into the model.