STYLOMETRY – Software
A pioneering Greek software that developed by a Greek researcher and his colleagues can analyze the texts of anonymous authors on social media and provide the right sex, age, and psychological characteristics of their personality.
Even in short texts like Twitter, the software “reads” Greek, English, Spanish and Italian- automatically, writer gender recognition with accuracy above 90%. For more posts on social networks or even longer texts (eg. 5,000 words), the accuracy can reach 100%. To estimate the age and characteristics of the personality, the accuracy is between 40% to 70%.
In an interview with Athenian and Macedonian News Agency, the creator of the software Mr. George Mikros, professor of Computational & Quantitative Linguistics and president of the University of Athens at Italian Language & Literature Department and vice president of the International Quantitative Linguistics Association, stresses that there is a different biological basis in linguistic mechanism of men and women, while women have better language processing.
As he says, women use both hemispheres to produce speech, while men just left. Also women outperform men in language tests and their ratio is usually more “official.”
The different expressed both sexes, may be recognized by artificial intelligence algorithms in the software program developed by Mr. George Mikros, which can analyze an anonymous text, concluding it has been written by a man or woman. The software can even draw conclusions about the age of the author, but also for basic characteristics of his personality.
Software of this kind belong to the Yfometrias field (Stylometry), which combines techniques of Linguistics and Information Technology. Mr. George Mikros started to develop this program in 2007 and even the perfect, in collaboration with researchers in the US.
“The possible practical applications are many,” he says. First in criminology (eg. to identify the author of a terrorist notice or an anonymous threatening letter) in literature (eg. to detect the literary paternity old texts), to identify any kind of plagiarism (eg. in a student or other work), to investigate the dynamics of public opinion on the internet (eg. via the emotional analysis involving the suspension of social networks for a politician or a company), in education etc. .
G. Mikros is also adjunct professor at the Department of Applied Linguistics at the University of Massachusetts in Boston. Since 1992 is a research associate of the Institute for Language and Speech Processing at the Research Center “Athena” (in which has contributed to the development of language technology software), while this year was appointed director of the undergraduate Program of Studies “Spanish Language and Culture” of the Hellenic Open University.
On Friday, October 21 (at 19:00), will speak at Herakleidon Museum Thissio, an event the Group ‘Thales + Friends “on” How different write men and women? Predicting the sex of the author in social media. “
Here’s the interview:
Q: There is actually a different biological basis in the language of women and men mechanism?
A: Yes, indeed women use both hemispheres of the brain during speech production, while men utilize only one left. Women also exhibit a range of anatomical variations in parts of the brain compared to those of men.
All these differences exist for the benefit of female language use as they allow interoperability of the cerebral hemispheres during language production and faster and better quality processing of linguistic data.
Q: So it is true that women prevail in language processing in relation to men?
A: Yes. All studies have been made in education, have shown that women have a small but firm grasp on language tests than men. This superiority has certified over time, but also intercultural, as seems to be true regardless of nationality and cultural background of the speaker.
Also, women use in their speech always more language types prestigious and prefer the socially prestigious language code as opposed to men, who often adopt language elements lower social status. Finally, women worldwide, have lower rates of pathologies language development, as well as faster and more efficient recovery in their language functions after stroke.
Q: There are some basic differences between men and women in their expression, and may indeed an algorithm to “catch” these differences?
A: ‘Men and women have fundamentally different ways of linguistic expression. The differences ranging from the vocabulary, the structural options and extended to a plurality of linguistic features is subconsciously as to their function.
They include, inter alia, the length of words, the sentence length, the frequency of certain character sequences, and parts of speech. These are some of the features that utilize the artificial intelligence algorithms to construct statistically male and female linguistic usage models and then reallocate them to analyze and predict the sex of the author of a text of unknown authorship.
Q: How effective is your software to recognize, sex, age or personality?
A: The correct recognition rates of the author’s sex exceed 90%, even when the texts used are extremely small in size, such as tweets that do not exceed 140 characters. Accordingly, the accuracy of the age category of a writer ranges from 40-70%, depending on the language features that will be used and the texts that will be given to train the algorithm. The accuracy of identification of the author’s personality ranges at respective levels to those of the age group, although in recent years the algorithms are more accurate and those rates are increasing.
Q: What might be the practical applications of such software?
A: Such software can be used to identify the authorship of anonymous documents in a wide range of circumstances. Texts which are of forensic interest, can be analyzed by such software and to draw useful conclusions about the identity of the author, and various characteristics, such as gender, age and personality.
Important, too, application can be made in case of history texts paternity investigation and literary interest with an indicative example a recent study completed in 2012 and gave anonymous translations of the 19th century to Papadiamantis.
Another analysis area where this software is used, the resolution of the reading difficulty of the text and automatically categorize them into levels of difficulty, depending on the educational level of students are addressed.
Finally, this software can be used to explore the emotions of the text and to determine whether the author maintains a positive or negative attitude to the subject it deals with. A real use case example automatic evaluation of a review for a restaurant and the determination of a positive or negative.
Q: When will you complete the software and will be ready for implementation by users?
A: The software developed is a laboratory prototype. The development started in 2007 and is still in development phase.
The wide range of potential applications is directing us to gradually rebuild, so can be used by bodies or persons who could benefit from analytical skills. In this context, we have active collaboration with research groups in the US, with which we cooperate closely in order to obtain a software which can analyze texts in many languages and are parallel user-friendly.
Q: Give us an overview of what is and seeks Computational Stylistics?
A: The Computational Stylistics is an interdisciplinary branch explores how to write texts and how this is linked to the identity of the author or other characteristics such as gender, age, psychological characteristics.
This sector requires intensive collaboration sciences such as linguistics, the Natural Language Processing, the Literary Analysis, statistics, Information Retrieval, Artificial Intelligence and, in particular, the sector of Machine Learning.
Automatic detection author has made the last decade significant progress, both the reliability of the methods, as well as to the efficiency and sensitivity of techniques have been developed.
What will be done under specific must be emphasized, it is that, like any tool in unskilled hands can be dangerous, so in yfometriki analysis, yield writing paternity critical real problems (eg. forensic interest) strict standards which minimize the experimental error.
more : George Mikros