Employees of the Machine Learning Laboratory at ITMO University are engaged not only in theory, but also in applied projects. Some of them manage to inspire members of the scientific and professional community around the world, transform business and the digital space. Such work is carried out by the Media Research Group under the guidance of Professor Alexander Farseev . Today he talks about the research and projects of his team.
At the Media Research Group, which is part of ITMO's Machine Learning Lab, we work in several research areas. They are associated with the use of artificial intelligence systems in the analysis of social network data and the generation of synthetic multimedia content. Moreover, all our projects somehow find practical application - for example, profiling in social networks.
We're discussing the analysis of users` data here. The purpose is to figure out who they are, what they like to do, and what personality type they have. Profiling is used in social, marketing, political and other research.
In 2017, the news surrounding Donald Trump drew a lot of attention to our profiling algorithms. Our algorithms decided Trump was single based on Twitter data, despite the fact that he was married. This news was reported widely, and even The Independent published an article about our research. Many people were surprised by the conclusion about Trump's marital status, but it did, in my opinion, serve to portray the "real face" of the former president.
It should be noted that the algorithm's accuracy was greater than 80%, indicating that the model was valid. The thing is that Trump's psychographic behavior did not match his demography. You'd never suspect that the author of Trump's tweets is an elderly married man who holds a powerful political position if you didn't know who he was.
Chances are, like our algorithm, you'd think it was someone much younger.
Researchers' perceptions of the market situation do not always reflect real state of affairs. Thus, in the marketing experts` universe only 35–40-year-old women purchase goods for children. In fact, their aunts, uncles, and fathers buy similar items as well. And mothers may enjoy basketball, for example. However, marketing gurus frequently ignore it. Machine learning algorithms help to more accurately formulate and test various hypotheses.
Depending on the goals and the research model chosen, we consider age, location, subscribers, published videos and photographs, post texts, and other data throughout the profiling process. One of the very first questions that arises while developing a machine learning model is how to incorporate all the relevant data in a balanced way. As a result, we create algorithms for what is known as "multimodel" machine learning. They may work with data from a variety of sources and datasets. Thus, this method helps to create a complete picture of your consumers and produce accurate profiling.
We used the MBTI scale (Myers-Briggs Typology) to predict the features of social network users in a number of studies, and in one of them, we decided to focus on predicting the married status of users, because this characteristic greatly impacts people's interests and behavior. We used the NUS-MSS database, which contains multimodel data from three social networks (Twitter, Foursquare, and Instagram) as well as accurate records on individuals' family status from three regions (Singapore, New York, and London). We divided NUS-MSS users into married and unmarried groups to build a predictive model with quantitative values, and then used feature selection algorithms to identify attributes associated with marital status. We applied feature selection techniques to the two groups that resulted from extrapolating the findings. The table below shows the average accuracy of the model's forecasting ability for three locations.
Our experience shows that combining data from two sources can in some cases improve prediction accuracy by 17%. It considers not just an individual user's behavior, but the behavior of other users who are similar to him. Similarity is determined by hitting the clusters identified on the basis of data from several social networks. You can read about spectral clustering, which is a key concept in this study, in our article . If you're interested in digging deeper, check out Java's implementation of such clustering.
This is just the tip of the iceberg on the possibilities of AI systems in the analysis of data from social networks. Some AI cloud systems (such as Social Bakers or SoMin.ai, which I founded) are able to go far beyond personal profiling and use what is called psychographic analysis. It consists in revealing hidden personality traits that determine our daily decisions in literally every aspect of life.
Marketing experts spend dozens of hours preparing multiple variations of content. After all, they must "strike" the right audience, reflect the company brand, and, make the content appealing to customers. It also needs to be adjusted for different channels, which adds to the time commitment. This is the challenge that our second research topic addresses: marketing professionals can focus on creativity and strategic decisions with the support of machine learning technologies while the content will be generated by automated systems.
Content generation is possible with generative adversarial networks . Their architecture consists of two main parts - a generator and a discriminator.. The first is is in charge of creating synthetic content, while the second is in charge of determining if the content in front of it is real or not. At each iteration, the generator takes the discriminator's results into account. If the discriminator cannot tell the difference between a synthetic image and a real photo, it means that the generator is producing realistic synthetic images.
For the digital marketing sector, as well as other professions and fields of activity, GANs is the technology of the future. We also use GANs in our commercial projects; for example, we used one of the architecture variations when designing the world's first AI-powered influencer for PUMA Asia Pacific. We named this character Maya . She takes selfies and lives her usual virtual life. To create it, millions of faces were matched from various sources, including Instagram. This made it possible to visualize several versions of the face, which was the first step in creating a virtual blogger.
However, exclusively generative adversarial networks are indispensable here. Due to the business nature of the project, I am unable to reveal all technical details. However, I'd like to highlight a tool that has shown to be valuable in this project as well as others involving profiling. This is a Hill Climbing search, which is a strategy for discovering the best solution by gradually modifying one of the solution's aspects. It is used as an optimization strategy for non-convex ensemble models. We often use Hill Climbing in cases where we have the task of selecting the parameters of machine learning algorithms and there is no way to iterate over all combinations - for example, due to the complexity of each training pass. This challenge is solved in a much fewer number of passes with Hill Climbing, which speeds up the training process.
It's also crucial to be able to use Hill Climbing with Random Restart, a minor modification of the algorithm. The thing is that we restart Hill Climbing many times with different random values for the parameter departure points, boosting our chances of discovering a global minimum rather than a local minimum, even for non-convex optimization problems. This is a highly useful heuristic for quickly selecting parameter values that are likely to be near to perfect. The technique's code implementation can be found here.
In particular, the Hill Climbing technique was used in one of our first social media user profiling projects. This project is covered in the article Harvesting multiple sources for user profile learning: a big data study. Here we perform data fusion by modeling the sources as a linear combination of the predictions of machine learning models trained on each source separately - the so-called Late Fusion Ensemble. It is obvious that we will not be able to attain the optimal outcomes by combining sources and weights. After all, text data from Twitter, for example, can be more useful than text data from Foursquare alone (designed to exchange geodata points). This is where approaches like Hill Climbing come in to identify the proper weights for each social network and data modality to generate good integrated model results efficiently and rapidly (without going through all combinations of sources).
Profiling can be used in conjunction with synthetic content. The most appealing auto-generated advertisement will be provided to the user based on their interests. Let's say a fast-food chain put up a billboard promoting a new burger, We may build another hundred versions of the banner and see which ones the public prefers. As a result, user profiling and content creation naturally complement each other. In fact, SoMin.ai blends these two research areas into a useful marketing tool. Guided by the MBTI personality type, which is automatically determined by analyzing content from social media profiles, SoMin.ai generates new content based on the preferences of other users with the same personality type. This is how the structure of the SoMin.ai platform looks:
On the server side, we collect content from companies through native interactions with their libraries and upload it to the platform every twelve hours, as shown in the diagram. The final five phases are carried out at intervals ranging from 24 hours to 30 days:
A more complete description of how the platform works can be found in the article that my colleagues from the laboratory and I published at WSDM 2020.
Business understands the potential of these research areas, and the Media Research Group successfully unlocks it. I think that's why SoMin.ai became an OpenAI partner and my team got access to GPT-3 to develop social media advertising algorithms. Probably for the same reason, SoMin.ai won the prestigious Cool Vendors Award 2020 from Gartner. That's not all, though. Most recently, we launched a new project - SoPop.ai. This program analyzes bloggers' postings and analyzes how users react to them. It works in a similar way to SoMin.ai - it helps businesses discover blogs that can be used for promotion.
In addition, SoPop.ai has collaborated with Arrival Bank to take the platform to the next level by creating a digital bank for influencers. Bloggers and businesses in such an ecosystem will not only look for advertising opportunities, but also improve their content. In this scientific article, you'll learn about the technologies that the platform was built on.
What comes next? Robots on the streets, or virtual friends? Let's see what happens! One thing is certain: the machine learning laboratory will not be lacking in intriguing projects.
ORIGINALLY PUBLISHED IN HABR