Classifying Twitter Text By Gender

Classifying Twitter Text By Gender,twitter,machine-learning,classification,Twitter,Machine Learning,Classification,I have a couple hundred tweets at my disposal and I am looking to classify each twitter user as a male and female by their getting their realname and looking at at least 2 of their tweets. I already have programmed getting each person's real name from their profile and I'm now looking to classify their tweet texts to try to make a stronger affirmation whether a user is a M or F. I've looked and searched online for examples of text classification but am not quite sure where to begin. I also found

I have a couple hundred tweets at my disposal and I am looking to classify each twitter user as a male and female by their getting their realname and looking at at least 2 of their tweets. I already have programmed getting each person's real name from their profile and I'm now looking to classify their tweet texts to try to make a stronger affirmation whether a user is a M or F. I've looked and searched online for examples of text classification but am not quite sure where to begin. I also found some VERY useful data at this link Twitter Text With Gender Download. Any suggestions on how to classify tweet text as written by a male or female would greatly be appreciated! I have sort of hit a brick wall.


#1

I don't have any other text datasets that are for SURE written by males or females to aid in training the classifier.

This is a hurdle for you then. Either you need to perform supervised learning with such a data set, for instance using a perceptron learner; or you need to perform unsupervised learning, for instance k-means clustering, and try to find clusters that you can (somewhat arbitrarily) declare to be male or female signals. Distinguishing gender in an unsupervised approach is going to be next to impossible in practice, at least without some other existing information, priors, or feature maps that you can build upon.


#2

You need a training set, this is an obvious statement. There are no other way. And as it was already stated in your previous question Using Naive Bayes Classification to Identity a Twitter User's Gender you can either create them by hand, or in semi-supervised fasion, where you create your training set using external rules (like those real names).

The easiest way is to use already existing tweets data for training your classifier with gender labels, I would suggest: http://clic.cimec.unitn.it/amac/twitter_ngram/

Other resources: blog gender: http://www.cs.uic.edu/~liub/FBS/blog-gender-dataset.rar


#3

You can have a look at my python gender detection project https://github.com/muatik/genderizer

It tries to detect authors' genders looking their names and/or sample text(for example tweets) of them.


#4

You might also want to take a look at this REST API which returns gender based on first name: http://www.thomas-bayer.com/restnames/


#5

genderComputer is a Python script by @Bogdan Vasilescu that tries to infer a person's gender from their name (mostly first name) and location (country). The tool combines information from different countries with information about diminutives, l33t-speak and data from gender.c, an open source C program for name-based gender inference.


#6

Chance, that a k-means, or any other clustering will distringuish gender are close to 0, this is not a good suggestion. This will just find any data separation, gender of the speaker is very subtle thing, which has to be carefully trained/engineered, clustering won't work.

#7

@lejlot Agreed, probably a longshot given the nature of the data.

#8

Okay, so much to my joy I've found some files that might be useful to me. I edited my original post to include it above. It has terms and user id's with estimate male or female.

#9

Thank you very much for finding this dataset for me! I sincerely appreciate it. I will definitely take a look at this.

#10

Could you provide the source and a few more details (or point to) regarding the blog gender dataset? I would like to use it as part of a research project but I need to know more about it. When/How it was collected etc.

#11

Nevermind, I found the source. For anyone interested it is described/used in here: cs.uic.edu/~liub/publications/EMNLP-2010-blog-gender.pdf

#12

Cool concept, but after testing it, name is fine, but using just a text seems pretty inaccurate. 'I am a mother of two children' gives 'None'; ''I am a mother of three children' gives 'male'; ''I am a mother of five children' gives 'female'; 'I love shopping at the mall' gives 'None', and even 'I am a female doctor' gives 'male'.

#13

Yes, I do not claim its accuracy is at the desired level mostly because it needs to be trained with English texts github.com/muatik/genderizer/issues/1 .

#14

Sounds good, could you tell me briefly how I can train with an existing microblog data set with your package?

#15

you can use this script github.com/muatik/genderizer/blob/master/genderizer/data/…