Problem Statement as being a information scientist when it comes to marketing department at reddit.
2 hafta önce yayınlandı.
Toplam 3 Defa Okundu.
gafsad271988 Yayınladı.
Bağlantıyı Paylaşmak İstermisiniz?

i must discover the most predictive key words and/or phrases to accurately classify the the dating advice and relationship advice subreddit pages therefore we may use them to ascertain which ads should populate for each web page. Because this is a category issue, we’ll utilize Logistic Regression & Bayes models. Misclassifications in this situation will be fairly benign thus I will make use of the precision rating and set up a baseline of 63.3per cent to price success. Utilizing TFiDfVectorization, I’ll get the function value to ascertain which terms have actually the greatest forecast energy for the goal variables. If effective, this model may be utilized to focus on other pages which have similar regularity associated with same terms and expressions.

Data Collection

See dating-advice-scrape and relationship-advice-scrape notebooks because of this component.

After switching all of the scrapes into DataFrames, we stored them as csvs that you can get within the dataset folder of the repo.

Information Cleaning and EDA

  • dropped rows with null self text line becuase those rows are worthless in my opinion.
  • combined name and selftext column directly into one brand brand new columns that are all_text
  • exambined distributions of term counts for games and selftext column per post and contrasted the 2 subreddit pages.

Preprocessing and Modeling

Found the baseline precision rating 0.633 which means that if I always select the value that develops most frequently, I’ll be appropriate 63.3% of that time.

First attempt: logistic regression model with default CountVectorizer paramaters. train rating: 99 | test 75 | cross val 74 Second attempt: tried CountVectorizer with Stemmatizer preprocessing on first pair of scraping, pretty bad rating with a high variance. Train 99%, test 72%

  • attempted to decrease maximum features and rating got a whole lot worse
  • tried with lemmatizer preprocessing instead and test score went as much as 74per cent

Just enhancing the data and y that is stratifying my test/train/split increased my cvec test score to 81 and cross val to 80. Including 2 paramaters to my CountVectorizers helped a lot. A min_df of 3 and ngram_range of (1,2) increased my test score to 83.2 and get a cross val to 82.3 Nevertheless, these rating disappeared.

I believe Tfidf worked the greatest to diminish my overfitting due to variance problem because

we customized the end terms to just simply just take the ones away that have been really too regular to be predictive. It was a success, nevertheless, with additional time I most likely could’ve tweaked them a little more to boost all ratings. Taking a look at both the solitary terms and words in categories of two (bigrams) had been the most useful param that gridsearch recommended, nevertheless, each of my top many predictive terms wound up being uni-grams. My initial directory of features had a good amount of jibberish terms and typos. Minimizing the # of that time period an expressed term ended up being expected to show up to 2, helped be rid of these. Gridsearch additionally recommended 90% max df rate which aided to eradicate oversaturated terms aswell. Finally, establishing max features to 5000 decreased cut down my columns to about 25 % of whatever online payday loans Nevada they had been to simply concentrate probably the most commonly used terms of the thing that was kept.

Summary and tips

Even though i’d like to have greater train and test ratings, I became in a position to effectively reduce the variance and you will find undoubtedly a few terms which have high predictive energy

and so I think the model is willing to introduce a test. The same key words could be used to find other potentially lucrative pages if advertising engagement increases. It was found by me interesting that taking out fully the overly used terms aided with overfitting, but brought the precision rating down. I believe there clearly was probably nevertheless space to relax and play around with the paramaters associated with Tfidf Vectorizer to see if different end terms produce a different or


Used Reddit’s API, needs library, and BeautifulSoup to clean posts from two subreddits: Dating guidance & union information, and trained a binary category model to predict which subreddit confirmed post originated in