삶의 의미는 성장에 있다: (E+K) Python for Social Scientists(사회과학자를 위한 파이썬)

파이썬에 대해서 잘 정리해 놓은 블로그가 있어서 읽고, 일부를 한글로 옮겨 보았습니다.

가면 갈수록 더 좋은 Quality 의 블로그 들이 나오는 것 같군요.

언젠가 직접.. 깔끔한 블로그를 구성해 보고 싶네요 ㅎㅎ

---

Python for Social Scientists

This is a guest blog post by Nick Eubank, a Ph.D. Candidate in Political Economy at the Stanford Graduate School of Business

이 글은 Nick Eubank, 스탠포드 비즈니스 대학원에서 정치경제 박사과정, 이 쓴 글입니다.

Python is an increasingly popular tool for data analysis in the social scientists. Empowered by a number of libraries that have reached maturity, R and Stata users are increasingly moving to Python in order to take advantage of the beauty, flexibility, and performance of Python without sacrificing the functionality these older programs have accumulated over the years.
파이썬은 사회과학자들의 데이터 분석툴로서 점점 더 인기가 높아지고 있습니다.. 성숙도가 높은 많은 라이브러리들로 인하여 더욱 강해지고 있고, R과 Stata 사용자들도 오래된 툴들의 장점들을 잃지 않으면서도, 파이썬의 아름다움, 유연함, 그리고 성능 으로 대변되는 장점들을 취하기 위해서 파이썬으로 옮겨오고 있습니다.

But while Python has much to offer, existing Python resources are not always well-suited to the needs of social scientists. With that in mind, I’ve recently created a new resource —www.pythonforsocialscientists.org (PSS) — tailored specifically to the goals and desires of the social scientist python user.

하지만 파이썬이 많은것을 제공하는 반면, 사회과학자들의 필요에 맞는 형태로는 있지 않습니다. 그런 생각을 바탕으로 최근에 저는 사회과학자들에 맞춘 새로운 싸이트를 만들었습니다.

The site is not a new set of tutorials, however — there are more than enough Python tutorials in the world. Rather, the aim of the site is to curate and annotate existing resources, and to provide users guidance on what topics to focus on and which to skip.

이 사이트는 새로운 형태의 tutorial 이 아닙니다, 파이썬 튜토리얼은 충분할 만큼 세상에 많이 있습니다. 대신에, 이 싸이트에서는 이미 있는 내용들을 모으고 자세한 설명을 해서 사용자들에게 어느 토픽에 집중을 해야 하고, 어떤 것에 하지 않을 것인지를 알리기 위함 입니다.

Why a Site for Social Scientists?

Social scientists – and indeed, most data scientists – spend most of their time trying to wrestle individual, idiosyncratic datasets into the shape needed to run statistical analyses. This makes the way most social scientists use Python fundamentally different from how it is used by most software developers. Social scientists are primarily interested in writing relatively simple programs (scripts) that execute a series of commands (recoding variables, merging datasets, parsing text documents, etc.) to wrangle their data into a form they can analyze. And because they are usually writing their scripts for a specific, idiosyncratic application and set of data, they are generally not focused on writing code with lots of abstractions.

사회과학자들은 – 그리고 대부분의 데이터 과학자들 – 많은 시간을 개별적이고, 특이한 데이터 세트를 통계적 분석이 가능하도록 변환 하는데 쓰고 있습니다. 이 부분이 사회과학자들이 파이썬을 쓰는 방법이 개발자들이 쓰는 것과 다른 이유 입니다. 사회 과학자들은 비교적 간단한 프로그램 명령 (변수 저장, 텍스트 문서를 구문 분석, 데이터 세트 병합 등)을 분석 할 수 있는 형태로 데이터를 다루는 일련의 명령 (스크립트) 작성에 주로 관심이 있습니다. 그들은 일반적으로 데이터의 특정, 특이한 프로그램과 데이터 세트에 대한 자신의 스크립트를 작성하기 때문에, 그들은 일반적으로 코드 작성시에 추상화에 초점을 맞추지 않습니다.

Social scientists, in other words, tend to be primarily interested in learning to use existing toolseffectively, not develop new ones.

다른말로 하면, 사회과학자들은 현재 있는 툴 들을 효율적으로 쓰는데 관심이 있다는 것입니다. 새로운 것을 만드는 것이 아니라.

Because of this, social scientists learning Python tend to have different priorities in terms of skill development than software developers. Yet most tutorials online were written for developers or computer science students, so one of the aims of PSS is to provide social scientists with some guidance on the skills they should prioritize in their early training. In particular, PSS suggests:

이런 이유 때문에, 파이썬을 배우는 사회과학자들은 스킬을 익힐 때 소프트웨어 개발자들과는 다른 종류의 우선순위가 생깁니다. 하지만 온라인의 많은 튜토리얼들은 개발자나 컴퓨터 과학과의 학생들을 위한 것이기 때문에, PSS의 목표는 사회과학자들에게 파이썬 배울 때의 우선순위를 가이드 하려고 합니다. 구체적으로는

Need immediately:

Data types: integers, floats, strings, booleans, lists, dictionaries, and sets (tuples are kinda optional)
Defining functions
Writing loops
Understanding mutable versus immutable data types
Methods for manipulating strings
Importing third party modules
Reading and interpreting errors

Things you’ll want to know at some point, but not necessary immediately:

Advanced debugging utilities (like pdb)
File input / output (most libraries you’ll use have tools to simplify this for you)

Don’t need:

Defining or writing classes
Understanding Exceptions

즉시 필요:

데이터 타입: integer, float, string, boolean, list, dictionary, and set (tuple 은 옵션)
함수 선언
루프
Mutable 과 immutable 데이터 타입(const 를 이야기 하는 것 같음)
문자열을 다루는 방법들
third party 모듈을 사용하는 법
에러를 해석하는 법

알면 좋음, 하지만 즉시는 아님:

높은 수준의 디버깅 하는 법 (pdb 같은 것)
파일을 읽고 쓰는 법 (대부분 라이브러리들에서 쉽게 쓸 수 있게 해 놓았다.)

필요 없음:

클래스 쓰는 법
예외처리 하는 방법

Pandas

Today, most empirical social science remains organized around tabular data, meaning data that is presented with a different variable in each column and a different observation in each row. As a result, many social scientists using Python are a little confused when they don’t find a tabular data structure covered in their intro to Python tutorial. To address this confusion, PSS does its best to introduce users to the pandas library as fast as possible, providing links to tutorials and a few tips on gotchas to watch out for.

현재, 대부분의 실증적인 사회과학은 표의 형태(row와 column이 각각 의미를 가지고 이들의 교차도 의미를 가지는)로 사용되고 있습니다. As a result, many social scientists using Python are a little confused when they don’t find a tabular data structure covered in their intro to Python tutorial. 이런 혼란을 해결하기 위해서, PSS는 최대한 빨리 pandas 라이브러리를 소개하려고 한다, 이 튜토리얼들과 팁들에 링크를 걸면서 ( 이 문단은 제대로 해석 안됨 )

The pandas library replicates much of the functionality that social scientists are used to finding in Stata or R — data can be represented in a tabular format, column variables can be easily labeled, and columns of different types (like floats and strings) can be combined in the same dataset.

Pandas 라이브러리는 사회과학자들이 주로 사용했었던 Stata 나 R 에서 사용했었던 기능들을 그대로 사용할 수 있도록 만들어져 있습니다. – 데이터는 표이며, 컬럼은 쉽게 이름이 붙을 수 있고, 컬럼에 서로 다른 타입이 들어있을 수 있는 것 (가량 소수, 문자열)

pandas is also the gateway to many other tools social scientists are likely to use, like graphing libraries (seaborn and ggplot2) and the statsmodels econometrics library.

Pandas는 사회과학자들이 많이 사용하는 다른 툴(그래프를 그리거나(seaborn, ggplot2) 계량경제학과 관련된 통계모델을 사용하는) 을 사용하도록 하는 gateway 입니다.

Other Libraries by Research Area

While all social scientists who wish to work with Python will need to understand the core language and most will want to be familiar with pandas, the Python eco-system is full of application-specific libraries that will only be of use to a subset of users. With that in mind, PSS provides an overview of libraries to help researchers working in different topic areas, along with links to materials on optimal use, and guidance on relevant considerations:

사회과학자들이 파이썬에서 가장 기본적인 내용과 pandas를 가지고 일하기를 원하기는 할테지만, 파이썬은 풍부한 기능을 가진 언어입니다. PSS에서는 각 분야의 연구자들이 사용할 수 있는 라이브러리 들도 공유 하려고 합니다..

Network Analysis: iGraph
Text Analysis: NLTK, and if needed coreNLP
Econometrics: statsmodels
Graphing: ggplot and seaborn
Big Data: dask and pyspark
Geo-Spatial Analysis: arcpy or geopandas
Making code faster: %prun in iPython (for profiling) and numba (for JIT compilation)

Want to Get Involved?

This site is young, so we are anxious for as much input as possible on content and design. If you have experience in this area you want to share please drop me an email or comment on Github.

이 싸이트는 젊습니다. 그래서 내용과 전체 설계에 더욱 많은 input이 필요합니다. 당신이 만약 이 분야에 경험이 있고 그것을 공유하고 싶다면 나에게 이메일을 보내거나 Github에 코멘트를 남겨 주기 바랍니다.

Reference
원글: https://realpython.com/blog/python/python-for-social-scientists/
Python for Social Scientists: http://www.pythonforsocialscientists.org/

삶의 의미는 성장에 있다

2016년 3월 9일 수요일

(E+K) Python for Social Scientists(사회과학자를 위한 파이썬)

댓글 없음:

댓글 쓰기