Predicting fakes profiles on Instagram

4 min readMar 29, 2021

Para ler em Portugues clique aqui

You have probably had to deal with fake Instagram profiles by sending you messages by sharing links or even being added in random groups.

In fact it is something boring that occurs in many social networks, not only instagram, and that end up compromising the user experience.

Evidently, such large companies were not left behind, and with the advancement of Machine Learning studies, they soon became concerned with creating solutions in order to alleviate this problem. Thus, Machine Learning models emerged that detect fake profiles and quickly exclude them and that is what we will learn today!

Our Data

The data used in this article is available on the Kaggle website and can be accessed here.

In summary, our variables are composed of numerical values and are as follows:

profile pic : do you have a profile picture or not? (0 for “doesn’t have” and 1 for “has”)
nums/lenght username : proportion of the number of numeric characters in the username in relation to its length
fulname words : full name in number of words
num/lenght fullname : relationship between the number of numeric characters in the full name and its length
name==username : is the username the same as the full name registered? (1 for true and 0 for false)
description length : number of characters in Bio
external URL : has external URL in Bio or not
private : o perfil é privado ? (1 para sim e 0 para não)
#posts : number of publications
#followers : number of followers
#follows : number of profiles the user follows
fake : 0 for “is not fake” and 1 for “is fake”. target variable

It is with these variables that we create a Machine Learning model that can inform us whether a certain profile is fake or not!

Previous analysis

Before we get started, it’s important that we know a few things about our data set. Firstly, it is interesting to understand the proportion of our target variable, that is, the “fake” variable.

Note that, we have equal values, both for 0’s and 1’s, that is: We have data in equal proportions! . This is interesting to note because it prevents any classification bias from the model.

Machine Learning

It’s time to create our model, for this article I summarized a lot what I did, basically on my notebook at GoogleColaboratory I showed 3 possible algorithms to use in our model, here I will use only 1, the Random Forest Classifier (the darling of data scientists).

The Random Forest Classifier (RFC) is an excellent classifier, both for its versatility and its efficiency. Come on?

modelo = RandomForestClassifier(n_estimators=120 , random_state=28  , max_depth = 100 , max_samples = 350)
modelo.fit(X_treino,y_treino)

predicoes = modelo.predict(X_teste)

print("The average accuracy rate of the model with tuning was:",accuracy_score(y_teste,predicoes))
print(classification_report(y_teste,predicoes))

After executing these lines of code, that is, after training the model and making predictions with the data we have, we arrive at the following results:

we had an average accuracy rate of 0.925, that is, our model has a 92.5% hit rate, hitting about 92% of cases where the profile was not fake and hitting 93% of cases where the profile was fake!

Conclusions

The results we obtained were promising, it is clear that no model is perfect, imagine that your profile is real and for some small errors did the model designate it as “fake”? in fact it is not a legal situation, both for the user and for the platform, but evidently 92.5% accuracy is not a bad value, especially when we compare it with the other results we had.

If you want to see the full article directly on the GoogleColaboratory notebook, click here and access!

Do not forget to also access my complete portfolio, there I have this and several other projects, to access click here!

Thanks for reading another article of mine, have a great day! :)

Predicting fakes profiles on Instagram

Our Data

Previous analysis

Machine Learning

Conclusions

Written by Luís Miguel Alves