PyScopus: An example for author disambiguity

Sometimes Scopus would mix up people with similar names. I recently come up with a not that difficult method to clean author publication profiles, which needs some manual work.

If you can think of a better way, please do let me know!

import pyscopus
pyscopus.__version__

'1.0.1'

from pyscopus import Scopus
key = 'YOUR_OWN_APIKEY'
scopus = Scopus(key)

Wrapper functions for disambiguity

import requests, time
import pandas as pd
from bs4 import BeautifulSoup as Soup
BASE_URL = "https://api.elsevier.com/content/abstract/scopus_id/"

def _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
    global BASE_URL
    r = requests.get(BASE_URL+sid, params={'apikey': apikey})
    soup = Soup(r.content, 'lxml')
    author_list = soup.find('authors').find_all('author')

    ## go through the author list to find the author first, by matching author id
    for au in author_list:
        if au['auid'] == author_id:
            ## find it and break
            break

    ## check the affiliation id: note that an author may have a list of affiliations
    this_affil_id_list = [affil_tag['id'] for affil_tag in au.find_all('affiliation')]
    ## get the affiliation id and check if there are any overlap
    if len(set(author_affil_id_list).intersection(set(this_affil_id_list))) > 0:
        return True
    return False

def check_pub_validity(scopus_obj, author_id, author_affil_id_list, apikey):
    ## first find out all pub
    pub_df = scopus_obj.search_author_publication(author_id)
    ## do this for all non-null scopus ids
    pub_df = pub_df[~pub_df.scopus_id.isnull()]
    ## list to save all eligble scopus ids
    eligible_scopus_id_list = list()
    for i, sid in enumerate(pub_df.scopus_id.values):
        if (i+1)%5==0:
            time.sleep(pd.np.random.random()+.3)
        if _check_pub_validity(sid, author_id, author_affil_id_list, apikey):
            ## if true, save it
            eligible_scopus_id_list.append(sid)
    ## finally, get a subset of the original pub_df
    filtered_pub_df = pub_df.query("scopus_id in @eligible_scopus_id_list")
    return filtered_pub_df

When I was collecting data for my own research, I found that Dr. Vivek K. Singh has a very noisy profile in Scopus. Let's use this as an example.

The basic idea is to match author-affiliation pair:

For all the paper found in the mixed profile
- Find the focal author (in this case, Dr. Singh)
- Look at his/her affiliation
  - Keep this paper if the affiliation is indeed where he/she is
  - If not, discard the paper

For Dr. Singh, I manually obtained his affiliation ids by searching through Scopus affiliation search. Upon obtaining that, create a dictionary containing name (first/last), affiliation name, and a list of affiliation ids. Author and affiliation names would be used to search for this author. The list of affiliation ids would be used for cleaning papers:

UC Irvine 60007278
MIT 60022195
Rutgers 60030623

d = {'authfirst': 'Vivek', 'authlastname': 'Singh', 'affiliation': 'Rutgers',
     'affil_id_list': ['60030623', '60022195', '60007278']
    }
d

{'authfirst': 'Vivek',
 'authlastname': 'Singh',
 'affiliation': 'Rutgers',
 'affil_id_list': ['60030623', '60022195', '60007278']}

query = "AUTHLASTNAME({}) and AUTHFIRST({}) and AFFIL({})".format(d['authlastname'], d['authfirst'], d['affiliation'])
author_search_df = scopus.search_author(query)
author_search_df

The history saving thread hit an unexpected error (OperationalError('database is locked',)).History will not be written to the database.

Sometimes we would obtain a list of author profiles for each author. In this case, we only have one and it is clear that the author profile is highly noisy.

In the following step, I would use the helper functions in utils to screen each paper by this author_id

author_id = '7404651152'
author_id, d['affil_id_list']

('7404651152', ['60030623', '60022195', '60007278'])

The filtering process may take a while, depending on how many documents are mixd up.

filterd_pub_df = check_pub_validity(scopus, author_id, d['affil_id_list'], key)
filterd_pub_df.shape[0], filterd_pub_df.scopus_id.unique().size, filterd_pub_df.scopus_id.isnull().sum()

(134, 134, 0)

Obviously, the number of papers is highly reduced. We can now check a random subset to see if the filtered papers make sense for this author.

filterd_pub_df.iloc[pd.np.random.randint(0, high=134, size=20)][['title', 'publication_name']]

However, there may still be noise in it (e.g., papers published in optics/photonics venues). We can manually exclude those as well:

filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('optic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('photonic')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('nano')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('quantum')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('sensor')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('cleo')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('materials')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('physics')")
filterd_pub_df = filterd_pub_df.query("not publication_name.str.lower().str.contains('chip')")
filterd_pub_df.shape

(59, 16)

And let's check again

filterd_pub_df.iloc[pd.np.random.randint(0, high=59, size=20)][['title', 'publication_name']]

Now it is much better and we can use this cleaned paper list for this focal author.

	title	publication_name
118	Effects of high-energy irradiation on silicon ...	Optics InfoBase Conference Papers
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
141	Low-stress silicon nitride for mid-infrared mi...	Optics InfoBase Conference Papers
194	Mid-infrared silicon waveguide resonators with...	Materials Research Society Symposium Proceedings
67	Effects of high-energy irradiation on silicon ...	Optics InfoBase Conference Papers
194	Mid-infrared silicon waveguide resonators with...	Materials Research Society Symposium Proceedings
89	Preface	Geo-Intelligence and Visualization through Big...
179	Demonstration of high-Q mid-infrared chalcogen...	Optics Letters
147	Low-Stress silicon nitride platform for broadb...	Optics InfoBase Conference Papers
309	Situation based control for cyber-physical env...	Proceedings - IEEE Military Communications Con...
42	Towards measuring fine-grained diversity using...	Proceedings of the 11th International Conferen...
306	Motivating contributors in social media networks	1st ACM SIGMM International Workshop on Social...
146	Mid-infrared opto-nanofluidics for label-free ...	Optics InfoBase Conference Papers
54	Gradient Polymer Nanofoams for Encrypted Recor...	ACS Nano
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
336	Towards environment-to-environment (E2E) multi...	MM'08 - Proceedings of the 2008 ACM Internatio...
144	Low-stress silicon nitride platform for broadb...	Conference on Lasers and Electro-Optics Europe...
209	Anisotropic photoluminescence from Er-TeO<inf>...	CLEO: Science and Innovations, CLEO_SI 2012
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
71	On-chip mid-infrared gas detection using chalc...	Applied Physics Letters

	title	publication_name
103	Assessing personality using demographic inform...	ACM International Conference Proceeding Series
44	Examining information search behaviors in smal...	Proceedings of the Association for Information...
44	Examining information search behaviors in smal...	Proceedings of the Association for Information...
18	Social bridges in urban purchase behavior	ACM Transactions on Intelligent Systems and Te...
62	If it looks like a spammer and behaves like a ...	International Journal of Information Security
152	EventShop: Recognizing situations in web data ...	WWW 2013 Companion - Proceedings of the 22nd I...
5	Are you altruistic? Your mobile phone could tell	2017 IEEE SmartWorld Ubiquitous Intelligence a...
103	Assessing personality using demographic inform...	ACM International Conference Proceeding Series
247	EventShop: From heterogeneous web streams to p...	Proceedings of the 4th Annual ACM Web Science ...
61	LTA 2016 - The first workshop on lifelogging t...	MM 2016 - Proceedings of the 2016 ACM Multimed...
34	Effect of gamma exposure on chalcogenide glass...	IEEE Radiation Effects Data Workshop
95	Physical-Cyber-Social Computing: Looking Back,...	IEEE Internet Computing
298	Structural analysis of the emerging event-web	Proceedings of the 19th International Conferen...
17	New Signals in Multimedia Systems and Applicat...	IEEE Multimedia
152	EventShop: Recognizing situations in web data ...	WWW 2013 Companion - Proceedings of the 22nd I...
317	Adversary aware surveillance systems	IEEE Transactions on Information Forensics and...
30	Toward harmonizing self-reported and logged so...	Conference on Human Factors in Computing Syste...
74	Predicting privacy attitudes using phone metadata	Lecture Notes in Computer Science (including s...
89	Preface	Geo-Intelligence and Visualization through Big...
64	Probing the interconnections between geo-explo...	UbiComp 2016 - Proceedings of the 2016 ACM Int...