Social Media Research
Python, Excel, HTML
In 2020 I was hired as a research assistant by a University of South Carolina Marketing professor to help her gather the data she and her graduate students needed for their research. They wanted to look at brands that market products specifically towards people of color (POC) and test people's sentiment towards the brand and products. They attempted to do this by looking at high traffic social media posts by these companies and analyzing the sentiment in the comments. This is also where they encountered a roadblock:
How do they quickly record multiple data points per comment for thousands of comments?
The six data points that they wanted to record from each comment were:
Number of Likes
Link to Profile
Link to Comment
If we assume it takes an average of about 10 seconds to copy each data point into a table in Excel that's one minute per comment. With thousands of comments that's easily over 75 hours of work for each post, and any additional post would require multiple weeks of work (undergraduate research assistants can only be paid for 15 hours of work per week at USC). Therefore my job was to create a tool that could reduce this time, and that I did.
Disclaimer: Most data points and URLs will have a black box over their content as to not display too much content from the research, as well as stay on the good side of any Social Media Terms of Service contracts. I have also used a small subset of data as an example.
I started by importing the packages necessary to scrape the data and then manipulate it. These obviously included bs4, pandas, selenium and then some lesser known packages like random, datetime, and selenium.webdriver (for options to increase performance).
Next I loaded in a list of options that affect how Selenium runs the Chrome Webdriver. All of these options were meant to reduce the memory utilization of the bot, as the social media sites weren't optimized for loading large amounts of comments at once, which could crash the Webdriver or load infinitely if I didn't take some precautions.
This is where the scraping actually took place. If you're interested in what each line does, take a look at the comments to the right or below each line.
I first had to comb the html and find where I can tell the Webdriver to enter my password and username as well as which buttons to press. I also entered how many blocks of comments I want the Webdriver to load (~12 comments per block, so total comments/12 = number of blocks). Once all of the code was written it was very hands off, I just had to wait for it to load all of the blocks and then it automatically download the html and saved it to a beautifulsoup object.
After the beautifulsoup object was created I could start manipulating the data. I'll be doing this process for each data point, so I'll just describe the basic process once. Each piece of content that you see on a website has HTML in the backend of the site that tells the site which content to load and where. You can try it on any website, just right click anywhere on the page and click "Inspect Element". You can even try it on this site! Now you can see the HTML that gives this site it's content and the CSS that makes it look pretty (if any real front end engineers read this please excuse my bad explanation, I'm not a front end engineer)
Now each element that repeats on a website, such as the basic data points for each comment that I mentioned earlier will all share what we'll call an "address", a way to easily/automatically find pieces of content in a sea of HTML text.
So in the code above I use that address that I talked about to parse the HTML and find all of the usernames, then I use a loop to run through these usernames and add each of them individually to a list. Easy (we'll get to some harder ones soon).
I then deleted the first two entries as they were both the name of the account that made the post. After that I make a few empty lists to hold data (which I actually repeat for some of them later on, no harm in declaring an empty list twice).
Next I did relatively the same process to add the profile pics for each user.
Same process, but this time with an added problem where there was a blank string between every comment. It's pictured below on our small subset of data. Let's solve it.
Here I just used modulus to delete every second item (all of the blank strings). This is what it looked like after! I also deleted the last line as it's just the language currently used (occasionally the address grabs some extra
Next I recorded the number of likes with the same process as above. This also proved to be a big challenge. You'll see why.
The comments explain it quite well, so I'd recommend looking there, but basically every comment had two entries for the like count, the number of likes and the string, "Reply". But if there were 0 likes on the comment then there would only be the string, "Reply". No 0 for like count. You see the problem?
Here is my solution. Please read the comments to see what each step does:
After all of that we have a good looking list where each number corresponds to one comment. No "Reply"s and 0's for the comments that had no number. Nothing better than a good looking list.
Finally, we can move onto another easy parse. Here I parsed the HTML for the Date the comment was posted, and then deleted the first date as it was the post date.
Same process for the link to the comment, but this time I need to add the web address before the link (to make it a URL):
I then repeated the process for the link to the user's profile that posted the comment (allows researchers to approximate the race of the person to make the research more informative). This was also the last data point! Now we can see what it looks like when it's all combines into a Pandas DataFrame.
All ready for analysis!
In the end I think the tool came out pretty well. Because of how the websites were coded the program had a limit of a few thousand comments per post, which was normally enough for 50-100% of the comments on each post (more than enough for the analysis). Once there were a few thousand comments loaded the amount of memory that the page used was astronomical, and load times between comment batches went from a few seconds to 20+ minutes. The exponential curve was steep.
If I was to do a project like this again I would like to use this type of code on a site that has multiple pages (wouldn't run into the memory issue) or try performing this on a computer that's stronger than my laptop. This was one of my favorite jobs and is something I'd love to do more of. The future is automation!